Blog

Benchmarks, architecture decisions, and honest post-mortems from the team building CRTX.

How We Killed Our Own Pipeline and Built the Loop

We built a multi-model pipeline, benchmarked it at 88% and $5.59/run, and rebuilt from scratch. The Loop scores 99% at $1.80/run.

We pointed CRTX at a 265,000-line production codebase. Four models collaborated. Two providers went down. The pipeline completed anyway.

Model-based review is unreliable. A test runner doesn't care how convincing the code looks — it either works or it doesn't.

Diagnosis, minimal context, model escalation — how CRTX resolves stubborn test failures without giving up.