Case StudiesFebruary 2026

Three Case Studies: Why Multi-Model Pipelines Need a Referee

Real data from 6 pipeline runs. Rate limiting middleware. Debate vs parallel. A $0.40 security audit.

Why Three Models Beat One

The Arbiter catches what no single model sees

We asked three frontier models to independently generate rate limiting middleware for a production API gateway. The task was specific: distributed rate limiter with Redis backing, sliding window counters, and graceful degradation.

All three models produced working code. All three made the same mistakes. No distributed locking around counter increments. No Redis connection pooling. No graceful degradation when Redis goes down. Three models, three blindspots, zero diversity of failure.

The Arbiter — running on a completely different model (Grok 4) — found 8 critical issues at 0.97 confidence.

REJECT(0.97)8 critical issues: no distributed locking, no connection pooling, no graceful degradation

Model

Tokens

Cost

Time

Verdict

Gemini 2.5 Pro

32,410

$0.48

1:12

—

GPT-4o

28,882

$0.44

1:08

—

Claude Opus 4.6

25,640

$0.51

1:30

—

Arbiter (Grok 4)

14,000

$0.38

—

REJECT

$1.81

total cost

3:50

duration

100,932

tokens

REJECT

verdict

Key insight: All 3 models independently produced the same blindspot. The Arbiter, using a different model, caught it immediately. This is the core argument for multi-model pipelines: systematic blindspots are invisible to the model that has them.

Debate vs Parallel — Same Prompt, Different Universes

$3 once beats $2 four times

Same rate-limiting task. Same models. We ran it through parallel mode four separate times. Every single run was REJECTED by the Arbiter. Then we ran it once through debate mode. Zero critical issues.

Here are the parallel results:

Run

Models

Cost

Verdict

Critical Issues

Run 1

GPT-4o, Gemini, Claude

$2.14

REJECT

5

Run 2

GPT-4o, Gemini, Claude

$1.98

REJECT

4

Run 3

Claude, Gemini, GPT-4o

$2.21

REJECT

6

Run 4

Gemini, Claude, GPT-4o

$2.05

REJECT

3

Four runs, $8.38 total, 18 critical issues across all runs. Every run rejected. The models working in parallel simply reinforce each other's assumptions — they don't challenge them.

Now the debate run:

Phase

Model

Time

Status

Position Papers

Gemini 2.5 Pro / Claude Opus 4.6

3:20

✓

Rebuttals

Both

2:37

✓

Final Arguments

Both

2:04

✓

Judgment

GPT-4o

0:41

✓

Arbiter

Grok 4

0:19

FLAG (0.82)

$3.05

total cost

8:19

duration

0

critical issues

FLAG(0.82)Minor style concerns only; no critical or structural issues found

Takeaway: $3 once beats $2 four times. Debate mode forced adversarial scrutiny. The models had to defend their design choices against a motivated opponent. Position papers, rebuttals, final arguments — each phase stripped away assumptions that parallel mode left unchallenged. The Arbiter's FLAG at 0.82 was for minor style concerns, not structural flaws.

The $0.40 Security Audit

The Arbiter found an RCE vulnerability that 3 models and cross-review all missed

This one is the scariest. Three models generated rate limiting middleware with Redis caching. The cross-review stage — where each model reviews another model's output — approved the code. Everyone signed off.

Then the Arbiter found eval() being called on data read from Redis. An RCE vulnerability with a CVSS score of 9.8.

Here is what the 3 models generated (and cross-review approved):

# What 3 models generated (and cross-review approved)
cached = redis_client.get(f"rate_limit:{client_ip}")
if cached:
    config = eval(cached)  # <-- RCE vulnerability (CVSS 9.8)

And here is what the Arbiter flagged, along with the fix:

# What the Arbiter flagged — and the fix
import json
cached = redis_client.get(f"rate_limit:{client_ip}")
if cached:
    config = json.loads(cached)  # Safe deserialization

REJECT(0.99)eval() on external data enables arbitrary code execution

If this code shipped to production, any attacker who could write to Redis — via cache poisoning, SSRF, or a compromised internal service — could execute arbitrary Python on your servers. The eval() call turns a data store read into a full remote code execution vector.

Three models missed it. Cross-review missed it. The Arbiter caught it for $0.17.

$0.40

total pipeline cost

$0.17

arbiter cost

9.8

CVSS severity

REJECT

verdict

The Numbers

6

pipeline runs

$13.63

total cost

27

critical issues found

1

RCE vulns caught

1

debate runs

0

debate critical issues

5

models used

Try It

CRTX is open source under Apache 2.0. The complete pipeline — parallel, sequential, debate mode, the Arbiter, smart routing, context injection, apply mode — runs locally with your own API keys.

pip install crtx

crtx run "Your coding task here" \
  --mode debate \
  --arbiter bookend \
  --route quality_first

View on GitHub Read the docs

CRTX is built by TriadAI. CRTX started as an internal tool to ship faster with AI — now it's open source so you can too.