CRTX generates code, runs tests, fixes failures, and gets an independent review — before you see it. Every output is verified. Model-agnostic.
Generate, test, fix, review — one command.
But it often has failing tests, broken imports, and missed edge cases. Developers spend 10–30 minutes per generation debugging and fixing AI output before it actually works.
Without meaningfully improving quality. Four models reviewing each other's prose doesn't catch a broken import statement. We benchmarked our own multi-model pipeline and found it scored lower than a single model at 15x the cost.
The issue isn't the model. It's the lack of verification. Nobody runs the code before handing it to you.
Natural language. What you'd tell a senior developer. The CLI accepts your intent and configures the Loop automatically.
crtx loop "Build a user auth service with JWT and SQLite"CRTX classifies task complexity — simple, medium, complex, or safety — and selects the best single model, fix iteration budget, and timeout tier. One model, best match.
simple → fast model │ medium → balanced │ complex → best + debateGenerate code, run tests automatically (AST parse, imports, pyflakes, pytest, entry point), feed failures back for targeted fixes, repeat until all tests pass. If the fix cycle stalls, three escalation tiers activate before giving up.
Generate → Test → Fix → Test → ✓An independent model — never the same one that generated the code — reviews the final output for logic issues, security gaps, and design problems that automated tests miss. On REJECT, one more fix cycle runs.
APPROVE │ FLAG │ REJECT │ HALTAfter the Loop's test-fix cycle converges, an independent Arbiter reviews the final output. It always uses a different model than the generator — cross-model review catches logic errors, security gaps, and design problems that automated tests can't detect.
On REJECT, the Loop runs one more targeted fix cycle and retests. No model grades its own work.
Same 12 prompts, same scoring rubric. CRTX Loop vs. single models vs. multi-model debate.
Dev Time = estimated developer minutes to get the output to production (test failures, import errors, entry point issues). Consistency = max score minus min score across all prompts. Run crtx benchmark --quick to reproduce these results with your own API keys.
When the normal fix cycle can't resolve a test failure, CRTX escalates.
The model explains the root cause without writing code. Then the diagnosis feeds back into a targeted fix attempt. Separating analysis from implementation avoids repeating the same mistake.
Strips context down to only the failing test file and the single source file it imports. Nothing else. A fresh perspective with less noise often resolves what full context couldn't.
Escalates to a completely different model (prefers o3 if primary was Sonnet, Sonnet if primary was o3). Includes the primary model's diagnosis: "they diagnosed this but couldn't fix it — what do you see?"
Bring your own API keys. CRTX handles the rest.
pip install crtx export ANTHROPIC_API_KEY=sk-ant-... export OPENAI_API_KEY=sk-... crtx loop "Build a user auth service with JWT tokens and SQLite"
"We benchmarked our own multi-model pipeline against single models and found it was making things worse. So we rebuilt from scratch. The Loop generates, tests, fixes, and reviews — every output is verified before you see it."
Free CLI forever. Pro dashboard coming soon.
Both tiers use your own API keys — you pay model providers directly.
Average Loop cost: $0.33–1.00 depending on task complexity.
Every output tested, fixed, and reviewed before you see it.
pip install crtxView on GitHub →