Skip to content

Why Crucis

~3 min read

Most AI code generators lack feedback loops. They produce code that looks right but breaks in production. Crucis closes the gap by generating adversarially-hardened test suites before writing a single line of implementation.

The problem

Code generation agents can produce impressive first drafts, but they can't iterate reliably on their own. Without structured feedback, they:

  • Hardcode known inputs — the function returns 3 when it sees (1, 2) rather than actually adding
  • Miss edge cases — works for positive integers, crashes on zero or negatives
  • Satisfy tests without real logic — a lookup table that maps known inputs to expected outputs
  • Stall without direction — when code fails, the agent needs specific, actionable feedback to make progress

The core issue isn't just false-passing code — it's that agents need structured feedback loops to work autonomously. Every human checkpoint ("did the tests catch cheating?", "does this generalize?", "does this meet style requirements?") is a place where an agent stalls or drifts without intervention.

The solution: an autonomy scaffold

Crucis is an autonomy scaffold for code-generating agents. It provides structured automated feedback — tests, adversarial review, constraints, holdout verification — so agents can iterate longer without human intervention. Each verification layer maps to a human checkpoint that Crucis automates.

Test-driven generation is the forcing function: tests are generated and hardened before any implementation exists. The implementation agent only sees the tests — never the objective — so it must write general code to pass them.

Four automated interventions replace human checkpoints:

Automated intervention Replaces this human checkpoint When it runs
Constraint gates "Does this code meet our style and complexity standards?" Every generation attempt
Adversarial review "Are these tests actually robust, or could someone cheat?" After test generation
Cheating probe "Can a fake implementation pass these tests?" After adversarial review
Holdout evals "Does this implementation generalize beyond the examples I showed?" After implementation

What you get

Feature What you get
Adversarial hardening cycles Tests that resist hardcoding and input-memorization cheats
Holdout evals Final safety net — implementation must generalize beyond known examples
34 static constraint checks Enforced complexity limits, security rules, and code quality standards
Checkpoint/resume Stop mid-run and pick up where you left off
--reset / --reset-task Iterate on specific tasks without restarting everything
Auto-holdout Holdout evals are automatically split from your examples — no manual setup needed
Flat constraints List constraints naturally; the system classifies them as required or advisory automatically
Background optimizer (Experimental) Prompt strategies improve automatically across runs
Multi-task objectives Define several related functions in one file, each verified independently

When NOT to use Crucis

  • One-off scripts where correctness doesn't matter much
  • UI/frontend code where behavior is visual, not functional
  • Tasks without clear input/output contracts (e.g., "make the app faster")
  • When you already have comprehensive tests and just need implementation

Crucis works best for functions and modules with well-defined input/output behavior — algorithms, data transformations, business logic, API handlers.