Open source — Apache 2.0

AI code generation
that tests and fixes
its own output

CRTX generates code, runs tests, fixes failures, and gets an independent review — before you see it. Every output is verified. Model-agnostic.

Start building →View documentation
Route
medium → Sonnet 4
⟨⟩
Generate
4 files, 312 lines
Test → Fix
29/29 passing
● live
Arbiter
APPROVE (0.91)
99%
avg benchmark score
2 min
avg dev time
1,096
tests passing
5+
LLM providers
$0.33–1.00
per run

See it run

Generate, test, fix, review — one command.

crtx — ~/project

The problem

Single models generate code that looks correct

But it often has failing tests, broken imports, and missed edge cases. Developers spend 10–30 minutes per generation debugging and fixing AI output before it actually works.

Multi-model pipelines cost 10–15x more

Without meaningfully improving quality. Four models reviewing each other's prose doesn't catch a broken import statement. We benchmarked our own multi-model pipeline and found it scored lower than a single model at 15x the cost.

The issue isn't the model. It's the lack of verification. Nobody runs the code before handing it to you.

How it works

01

You describe the task

Natural language. What you'd tell a senior developer. The CLI accepts your intent and configures the Loop automatically.

crtx loop "Build a user auth service with JWT and SQLite"
02

Smart routing picks the best model

CRTX classifies task complexity — simple, medium, complex, or safety — and selects the best single model, fix iteration budget, and timeout tier. One model, best match.

simple → fast model │ medium → balanced │ complex → best + debate
03

The Loop generates, tests, and fixes

Generate code, run tests automatically (AST parse, imports, pyflakes, pytest, entry point), feed failures back for targeted fixes, repeat until all tests pass. If the fix cycle stalls, three escalation tiers activate before giving up.

Generate → Test → Fix → Test → ✓
04

The Arbiter catches what tests can't

An independent model — never the same one that generated the code — reviews the final output for logic issues, security gaps, and design problems that automated tests miss. On REJECT, one more fix cycle runs.

APPROVE │ FLAG │ REJECT │ HALT
THE ARBITER

The final quality gate

After the Loop's test-fix cycle converges, an independent Arbiter reviews the final output. It always uses a different model than the generator — cross-model review catches logic errors, security gaps, and design problems that automated tests can't detect.

On REJECT, the Loop runs one more targeted fix cycle and retests. No model grades its own work.

APPROVE
Code passes review. Output is verified and presented.
FLAG
Minor concerns noted but acceptable to ship.
REJECT
Issues found. One targeted fix cycle runs, then retest.
HALT
Critical problem detected. Loop stops immediately.

Benchmark results

Same 12 prompts, same scoring rubric. CRTX Loop vs. single models vs. multi-model debate.

ConditionAvg ScoreMinConsistencyDev TimeCost
Single Sonnet94%92%±4 pts10 min$0.36
Single o381%54%±41 pts4 min$0.44
Multi-model Debate88%75%±25 pts9 min$5.59
CRTX Loop99%98%±2 pts2 min$1.80

Dev Time = estimated developer minutes to get the output to production (test failures, import errors, entry point issues). Consistency = max score minus min score across all prompts. Run crtx benchmark --quick to reproduce these results with your own API keys.

Three-tier gap closing

When the normal fix cycle can't resolve a test failure, CRTX escalates.

TIER 1~$0.08

Diagnose then fix

The model explains the root cause without writing code. Then the diagnosis feeds back into a targeted fix attempt. Separating analysis from implementation avoids repeating the same mistake.

TIER 2~$0.05

Minimal context

Strips context down to only the failing test file and the single source file it imports. Nothing else. A fresh perspective with less noise often resolves what full context couldn't.

TIER 3~$0.08

Second opinion

Escalates to a completely different model (prefers o3 if primary was Sonnet, Sonnet if primary was o3). Includes the primary model's diagnosis: "they diagnosed this but couldn't fix it — what do you see?"

No stubborn test gets shipped to you unfixed.

Get started in 60 seconds

Bring your own API keys. CRTX handles the rest.

quickstart.sh
pip install crtx
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...

crtx loop "Build a user auth service with JWT tokens and SQLite"

"We benchmarked our own multi-model pipeline against single models and found it was making things worse. So we rebuilt from scratch. The Loop generates, tests, fixes, and reviews — every output is verified before you see it."

CRTX Team

Simple pricing

Free CLI forever. Pro dashboard coming soon.

OPEN SOURCE
Free
Apache 2.0 — forever
Bring your own API keys
Full Loop: generate → test → fix → review
Smart routing (picks best model per task)
Independent Arbiter review
Three-tier gap closing
Built-in benchmark tool
5+ LLM providers (Claude, GPT, Gemini, Grok, DeepSeek)
REPL mode for interactive sessions
Apply mode — write code to your repo
Context injection from your codebase
Auto-fallback on provider outages
Unlimited local runs
pip install crtx
COMING SOON
PRO
Pricing TBD
Web dashboard + team features
Everything in Open Source
Web dashboard with real-time pipeline view
Session history and replay
Usage analytics and cost tracking
Team collaboration
Sign up for early access →

Both tiers use your own API keys — you pay model providers directly.
Average Loop cost: $0.33–1.00 depending on task complexity.

Stop debugging
AI-generated code

Every output tested, fixed, and reviewed before you see it.

pip install crtxView on GitHub →