Case StudyFebruary 2026

What Happens When You Stop Trusting One Model With Your Code

We pointed CRTX at a 265,000-line production codebase. Four models collaborated. Two providers went down. The pipeline completed anyway.

Every AI coding tool today works the same way: you send a prompt to one model, you get code back, you hope it's good. If the model has a blindspot — a security pattern it consistently misses, an architectural decision it always makes the same way — you inherit that blindspot every single time.

We built CRTX to fix this.

CRTX is an open-source orchestration engine that routes your coding task through multiple specialized stages — each handled by the best model for the job — with an independent Arbiter that reviews every output. The Arbiter uses a different model from the one that did the work. It can approve, flag warnings, reject with retry instructions, or halt the pipeline entirely.

The core insight is simple: no model should grade its own work.

We've been using CRTX internally on a financial services platform with 2,930+ Python tests and 85 JavaScript tests. Here's what happened across two real pipeline runs.

Test 1: When Two Providers Go Down at Once

The task: "Create a Python function that validates email addresses using regex, with comprehensive pytest tests."

Mode: Sequential. Routing: Hybrid. Arbiter: Bookend.

What happened next wasn't planned. Two providers went down simultaneously during the pipeline run.

Gemini 2.5 Pro hit a known Google quota bug — a 429 rate limit stuck on the free tier despite Paid Tier 1 credentials. Claude Opus 4.6 had a temporary 529 overloaded outage (a confirmed Anthropic incident on Feb 16-17). In a single-model tool, the job would have failed. In CRTX, the fallback chain kicked in automatically:

Fallback chain:
  Architect:  Gemini 2.5 Pro (0.90) → o3 (0.88)
  Refactor:   Claude Opus 4.6 (0.95) → Claude Sonnet 4.5 (0.90)
  Verify:     Claude Opus 4.6 (0.90) → o3 (0.90)
  Arbiter:    Claude Opus → Claude Sonnet (all reviews)

The pipeline completed. But the interesting part wasn't the fallback — it was what the Arbiter did next.

The Sonnet arbiter reviewed o3's verify output and rejected it. Then rejected the retry. Then rejected it again, with increasing confidence:

FLAG(0.87)Architect output — minor structural concerns

REJECT(0.92)Verify output — relative imports incompatible with flat directory

REJECT(0.98)Retry 1 — same structural issue persists

REJECT(1.00)Retry 2 — import/directory mismatch unresolved

The arbiter was right. It correctly identified that the code used relative imports but the files were laid out in a flat directory structure. A single model reviewing its own output would never catch this — the same reasoning that produced the mistake would review it.

$0.56

cost

24,812

tokens

601.9s

time

3

fallbacks

Test 2: 265,000 Lines, 20 Test Files, $3

The task: "Analyze the codebase and identify modules with insufficient test coverage. For each gap found, generate comprehensive pytest tests following the existing test patterns and conventions."

We pointed CRTX at the full production codebase — 1,214 Python files, 265,371 lines of code — with context injection and a 12K token budget. Mode: Sequential. Routing: Hybrid. Arbiter: Bookend.

Models used:
  Architect:  o3 (fallback from Gemini)
  Implement:  GPT-4o
  Refactor:   Claude Opus 4.6
  Verify:     Claude Opus 4.6

The output was substantial: 20 targeted test files covering core services, configuration layers, data schemas, notification handlers, and shared domain models, plus a shared conftest.py with SQLite fixtures.

Quality check: the generated tests used correct import paths, correct function signatures, correct parameter names. Parametrized tests, boundary value testing, idempotency checks, multi-tenant isolation tests. A developer could drop these into the test suite and have meaningful coverage in under an hour.

But the most valuable thing the pipeline produced wasn't the test files — it was a bug in CRTX itself.

The Opus verify stage returned a confidence score of 0.08. It wasn't being pessimistic. It genuinely couldn't see the code it was supposed to verify, because the refactor→verify handoff wasn't passing accumulated code. The arbiter's low confidence was a direct signal that the pipeline had an internal bug.

We found and fixed it in the same session (commit 3295537).

FLAG(0.82)Architect — Sonnet flagged structural gaps

REJECT(0.50)Verify — o3 rejected initial output

APPROVE(0.44)Verify retry — o3 accepted with caveats

$3.04

cost

92,488

tokens

638.2s

time

20

test files

The Numbers

2

pipeline runs

$3.60

total cost

117,300

total tokens

4

models used

5

auto-fallbacks

20

test files generated

1

pipeline bugs caught

997

CRTX's own tests

What We Learned

Cross-model review works. The Sonnet arbiter rejecting o3's verify output at increasing confidence levels — 0.92, 0.98, 1.00 — was correct. It found a structural mismatch between imports and file layout that o3 couldn't see in its own output. A single-model tool would never catch this because the same model that made the mistake would review it. The arbiter pattern isn't just a nice-to-have; it's the only architecture that catches systematic blindspots.

Auto-fallback is essential, not optional. Both Gemini and Claude Opus went down during testing — real outages, not simulated. The pipeline completed because fallback was automatic. Any production AI tool that hardcodes a single provider is fragile. We didn't plan to test resilience; the infrastructure forced us to, and CRTX passed.

The arbiter's confidence score is a meaningful signal. When Opus verify gave 0.08 confidence, it wasn't being pessimistic — it genuinely couldn't see the code it was supposed to verify. That low confidence led us to discover and fix a real handoff bug in CRTX's pipeline (commit 3295537). The confidence score isn't vanity metrics; it's a diagnostic tool.

Context injection quality determines output quality. The 12K token budget gave models file signatures but not implementations. The generated tests were structurally correct but couldn't assert against specific behavior they couldn't see. After bumping to 20K with full source for top files, output quality improved significantly. Budget tuning matters more than model selection for codebase-aware tasks.

Try It

CRTX is open source under Apache 2.0. The complete pipeline — all five modes, the Arbiter, smart routing, context injection, apply mode — runs locally with your own API keys.

pip install crtx

crtx run "Your coding task here" \
  --arbiter bookend \
  --route quality_first

View on GitHub Read the docs

CRTX is built by TriadAI. CRTX started as an internal tool to ship faster with AI — now it's open source so you can too.