← Back to Blog
Case StudyFebruary 2026

How We Killed Our Own Pipeline and Built the Loop

We built a multi-model pipeline, benchmarked it, found it was making things worse. So we rebuilt from scratch.

BEFORE

The multi-model pipeline

Multiple models, multiple stages, $4.85 per run

CRTX v0.1 was a multi-model pipeline orchestrator. Architect designs the approach, Implementer writes the code, Refactorer cleans it up, Verifier validates. Five modes: sequential, parallel, debate, review, improve. Smart routing across models. An Arbiter that reviewed each stage.

It was architecturally interesting. We were proud of it. Then we benchmarked it.

We ran 12 coding prompts through every condition: single Sonnet, single o3, multi-model debate, and our sequential pipeline. We scored each output on correctness, completeness, code quality, and test coverage. Then we ran every output through a test runner to see what actually worked.

The results were brutal:

Condition
Avg Score
Min Score
Spread
Cost
Single Sonnet
94%
92%
±4 pts
$0.36
Single o3
81%
54%
±41 pts
$0.44
Multi-model Debate
88%
75%
±25 pts
$5.59

Our multi-model debate mode — the one we'd built the entire product around — scored lower than a single Sonnet call at 15x the cost. The debate produced prose, not code. Models argued about architecture instead of writing working software. The Arbiter reviewed essays, not implementations.

The spread told the real story: ±25 points meant some prompts got great output and others got garbage. Inconsistency is worse than consistently mediocre — you can't trust the output.

INSIGHT

The problem was never the model

We had been solving the wrong problem. We assumed that combining multiple models would produce better code. It didn't. What it produced was more prose, more opinions, and more cost.

The real problem was simpler: nobody was running the code. Every model produced output that looked correct to another model. But looking correct and being correct are different things. A broken import, a missing function argument, a test that doesn't actually test anything — these are invisible to model-based review but immediately obvious to a test runner.

The fix wasn't more models. It was verification.

AFTER

The Loop

Generate → Test → Fix → Review. One model, verified output.

We rebuilt CRTX from the ground up around a different idea: generate code with the best model for the task, run it through a real test suite, feed failures back for targeted fixes, and then — only then — have an Arbiter review it.

The Loop uses one model, not four. It picks the best model for the task complexity. Simple tasks get fast models. Complex tasks get the strongest model plus an architecture debate. The key difference: every output gets tested before you see it.

The test runner checks five things: AST parse, import resolution, pyflakes static analysis, pytest execution, and entry point execution. If anything fails, structured errors feed back to the model for a targeted fix. If the fix cycle stalls, three escalation tiers activate: root cause diagnosis, minimal context retry, and a second opinion from a different model.

After tests pass, the Arbiter reviews for logic issues, security gaps, and design problems that automated tests can't detect. If it rejects, one more fix cycle runs.

The benchmark tells the story

Condition
Avg Score
Min
Spread
Dev Time
Cost
Single Sonnet
94%
92%
±4 pts
10 min
$0.36
Single o3
81%
54%
±41 pts
4 min
$0.44
Multi-model Debate
88%
75%
±25 pts
9 min
$5.59
CRTX Loop
99%
98%
±2 pts
2 min
$1.80
99%
avg score
98%
min score
±2 pts
consistency
2 min
avg dev time
$1.80
cost per run

Dev Time measures estimated developer minutes to get the output to production — based on test failures, import errors, and entry point issues. The Loop's output needs 2 minutes of developer attention on average. Single Sonnet needs 10. The debate output needs 9, despite costing 15x more.

Consistency is the real breakthrough. ±2 points of spread means every prompt gets near-perfect output. No more re-rolling the dice hoping for a good generation. The Loop's min score (98%) is higher than any other condition's average.

What we learned

1. Testing beats reviewing. Model-based review is unreliable. Models approve broken code. A test runner doesn't care how convincing the code looks — it either works or it doesn't.

2. One good model beats three mediocre passes. The sequential pipeline used four stages across different models. The Loop uses one model and verifies its output. Less complexity, better results.

3. Cost tracks waste, not quality. Our $5.59 debate runs scored lower than our $0.36 single-model runs. More money bought more prose, not more correctness.

4. The Arbiter still matters. But only after tests pass. The Arbiter catches things tests can't: logic issues, security vulnerabilities, design problems. Reviewing verified code is fundamentally different from reviewing unverified code.

Try it yourself

Run the benchmark with your own API keys and see the numbers firsthand:

pip install crtx

# Run the full benchmark
crtx benchmark --quick

# Or just try the Loop
crtx loop "Build a REST API with FastAPI, SQLite, search and pagination"
View on GitHubRead the docs
CRTX is built by TriadAI. We killed our own product and rebuilt it when the data said we were wrong. That's the kind of honesty we think this space needs.