Architecture¶
Crucis is an autonomy scaffold that generates, attacks, and validates code through a pipeline of LLM agents and static analysis, providing structured automated feedback so agents can iterate without human intervention.
System Overview¶
flowchart LR
objective["objective.yaml"] --> fit[Fit]
fit --> checkpoint[(Checkpoint)]
fit --> adversary[Adversary]
checkpoint --> evaluate[Evaluate]
evaluate --> sandbox[Sandbox]
sandbox --> optimizer[Optimizer]
Data Flow¶
-
Objective intake --
intake/objective.pyparses the YAML objective, validates eval expressions, resolves constraint profiles viaconstraints/loader.py. Constraints can be listed flat — the constraint registry auto-classifies them into required (blocking) vs advisory. The legacyprimary:/secondary:format is still supported for backward compatibility. -
Fit phase --
core/loop.pyiterates through each task:- Generates pytest train suites via the generation agent
- Validates syntax (AST parse) and constraints (static analysis)
- Presents for user approval (auto-approved by
crucis run) - Runs adversarial review: finds attack vectors, generalization gaps
- Generates and executes a cheating probe against the tests
- Saves progress to checkpoint after each task
-
Evaluation phase --
core/loop.pyorchestrates implementation:- Builds a curriculum from checkpoint data (
core/curriculum.py) - Sends curriculum + test files to the implementation agent
- Runs pytest in Docker sandbox (or host fallback)
- Verifies against hidden holdout evals
- Retries up to
max_iterationson failure
- Builds a curriculum from checkpoint data (
-
Background optimization (experimental, disabled by default) --
execution/optimizer.pyqueues a job after fit or evaluate. Enable withoptimizer: enabled: truein.crucis/settings.yaml.- Spawns a detached worker process
- Scores candidate policies against baseline using black-box evaluation
- Promotes winning candidates (manual or auto mode)
Task State Machine¶
Each task in the checkpoint progresses through these states:
stateDiagram-v2
[*] --> pending
pending --> train_suite_generated
train_suite_generated --> train_suite_approved
train_suite_approved --> adversarially_reviewed
adversarially_reviewed --> complete
complete --> [*]
The fit loop resumes from the last incomplete state on restart.
File Structure¶
project/
objective.yaml # Objective definition
plan.md # Structured generation plan (from crucis run --plan)
.checkpoint.json # Task progress and train suite sources
curriculum.md # Generated evaluation guide
constraints/
profiles.yaml # Constraint profile definitions
crucis/prompts/
templates/ # Jinja2 templates for all prompt rendering
tests/
test_<task>.py # Generated train suites
src/
<target_files>.py # Implementation targets
.crucis/
settings.yaml # Runtime settings
optimizer/
active_policy.yaml # Current steering policy
status.json # Optimizer state
worker.lock # Prevents concurrent workers
queue/ # Pending optimization jobs
runs/
<run_id>/
candidate_policy.yaml
result.json
report.md
Module Roles¶
| Module | Purpose |
|---|---|
__main__.py |
CLI entry point: init, run, status, validate, doctor, promote, optimizer-worker |
config.py |
Environment-based configuration (API keys, models, limits) |
models.py |
Pydantic models: objectives, checkpoints, constraints, reports |
display.py |
Rich terminal output (tables, panels, syntax highlighting) |
diagnostics.py |
Environment and workspace diagnostics (crucis doctor) |
cli/runner.py |
Subprocess wrapper for Claude/Codex agents |
core/loop.py |
Top-level fit and evaluation orchestration |
core/generation.py |
Test generation phase (generate → validate → review cycle) |
core/evaluation.py |
Evaluation phase (implement → verify retry cycle) |
core/verification.py |
Shared verification utilities (test running, constraint checking, holdout evals) |
core/_shared.py |
Shared helpers: preflight checks, test path collection, optimizer enqueueing, run logging |
core/planner.py |
Structured plan generation (crucis run --plan) |
prompts/__init__.py |
Jinja2 template engine for all prompt rendering |
prompts/_filters.py |
Custom Jinja2 filters (path_to_module, bool_label, readable_name) |
prompts/templates/ |
Jinja2 templates for generation, adversary, probe, evaluation, plan, curriculum, onboarding |
core/prompts.py |
Thin prompt builder wrappers delegating to Jinja2 templates |
core/adversary.py |
Adversarial review, probe generation and execution |
core/curriculum.py |
Markdown curriculum from checkpoint + objective |
core/test_generator.py |
Python extraction from LLM responses |
intake/objective.py |
YAML objective parsing and validation |
intake/scaffold.py |
Workspace scaffolding and agent-driven onboarding (crucis init) |
constraints/loader.py |
Profile loading, constraint resolution, and auto-classification via the constraint registry |
constraints/checker.py |
AST-based static analysis (44 checks — each classified as required or advisory) |
constraints/_class_metrics.py |
Class-level metric checkers (methods, fields, lines, WMC per class) |
constraints/_module_metrics.py |
Module-level metric checkers (efferent coupling, maintainability index) |
constraints/_python_idioms.py |
Python idiom checkers (naming conventions, single-char names, unnecessary else, len-as-condition) |
execution/sandbox.py |
Docker-isolated pytest execution |
execution/optimizer.py |
Background policy optimization job management |
persistence/checkpoint.py |
Checkpoint creation, save, and load |
persistence/policy.py |
Policy I/O, candidate management, status tracking |
persistence/settings.py |
Settings schema and persistence |
persistence/events.py |
Structured JSONL event logging |
defaults.py |
Shared utilities: sanitized environment, text excerpting |
_compat.py |
Python 3.10 compatibility shim (StrEnum backport) |
gepa_optimizer.py |
GEPA-powered policy optimization worker |
Agent Integration¶
Crucis uses CLI agents as subprocesses, not API calls:
- Generation agent (default:
claude) -- generates pytest train suites from objectives and constraints. Runs with--output-format jsonand empty--allowedTools. - Critic agent (default:
claude) -- performs adversarial review of generated tests. Same invocation as generation. - Implementation agent (default:
codex) -- writes code to pass the tests. Runs with--full-autoand access to Edit, Write, Read, Bash tools.
Agent selection and model are configured via environment variables or config.py:
| Setting | Default | Purpose |
|---|---|---|
generation_agent |
claude |
Agent for test generation |
generation_model |
claude-opus-4-6 |
Model for test generation |
critic_agent |
claude |
Agent for adversarial review |
critic_model |
claude-opus-4-6 |
Model for adversarial review |
implementation_agent |
codex |
Agent for code implementation |
implementation_model |
gpt-5.3-codex |
Model for code implementation |
MCP Server¶
The crucis/mcp/ package exposes every CLI capability as MCP tools for use by AI agents:
| Module | Purpose |
|---|---|
mcp/server.py |
FastMCP instance with 19 tools, 7 resources, 5 prompts |
mcp/_workspace.py |
Workspace authorization, path traversal prevention, input validation |
The server is a thin integration layer — all business logic stays in the modules listed above. See MCP Server for the full reference.
Verification Granularity¶
Crucis supports two verification modes set in the objective YAML:
task(default) -- each task is verified independently with its own pytest run and holdout check. Best for multi-task objectives where tasks are independent.objective-- all tasks are verified together in a single pytest run. Holdout failures are reported as counts only (redacted payloads) to prevent information leakage.