Architecture¶

~4 min read

Crucis is an autonomy scaffold that generates, attacks, and validates code through a pipeline of LLM agents and static analysis, providing structured automated feedback so agents can iterate without human intervention.

System Overview¶

flowchart LR
  objective["objective.yaml"] --> fit[Fit]
  fit --> checkpoint[(Checkpoint)]
  fit --> adversary[Adversary]
  checkpoint --> evaluate[Evaluate]
  evaluate --> sandbox[Sandbox]
  sandbox --> optimizer[Optimizer]

Data Flow¶

Objective intake -- intake/objective.py parses the YAML objective, validates eval expressions, resolves constraint profiles via constraints/loader.py. Constraints can be listed flat — the constraint registry auto-classifies them into required (blocking) vs advisory. The legacy primary:/secondary: format is still supported for backward compatibility.
Fit phase -- core/loop.py iterates through each task:
- Generates pytest train suites via the generation agent
- Validates syntax (AST parse) and constraints (static analysis)
- Presents for user approval (auto-approved by crucis run)
- Runs adversarial review: finds attack vectors, generalization gaps
- Generates and executes a cheating probe against the tests
- Saves progress to checkpoint after each task
Evaluation phase -- core/loop.py orchestrates implementation:
- Builds a curriculum from checkpoint data (core/curriculum.py)
- Sends curriculum + test files to the implementation agent
- Runs pytest in Docker sandbox (or host fallback)
- Verifies against hidden holdout evals
- Retries up to max_iterations on failure
Background optimization (experimental, disabled by default) -- execution/optimizer.py queues a job after fit or evaluate. Enable with optimizer: enabled: true in .crucis/settings.yaml.
- Spawns a detached worker process
- Scores candidate policies against baseline using black-box evaluation
- Promotes winning candidates (manual or auto mode)

Task State Machine¶

Each task in the checkpoint progresses through these states:

stateDiagram-v2
  [*] --> pending
  pending --> train_suite_generated
  train_suite_generated --> train_suite_approved
  train_suite_approved --> adversarially_reviewed
  adversarially_reviewed --> complete
  complete --> [*]

The fit loop resumes from the last incomplete state on restart.

File Structure¶

project/
  objective.yaml           # Objective definition
  plan.md                  # Structured generation plan (from crucis run --plan)
  .checkpoint.json         # Task progress and train suite sources
  curriculum.md            # Generated evaluation guide
  constraints/
    profiles.yaml          # Constraint profile definitions
  crucis/prompts/
    templates/             # Jinja2 templates for all prompt rendering
  tests/
    test_<task>.py         # Generated train suites
  src/
    <target_files>.py      # Implementation targets
  .crucis/
    settings.yaml          # Runtime settings
    optimizer/
      active_policy.yaml   # Current steering policy
      status.json          # Optimizer state
      worker.lock          # Prevents concurrent workers
      queue/               # Pending optimization jobs
      runs/
        <run_id>/
          candidate_policy.yaml
          result.json
          report.md

Module Roles¶

Module	Purpose
`__main__.py`	CLI entry point: init, run, status, validate, doctor, promote, optimizer-worker
`config.py`	Environment-based configuration (API keys, models, limits)
`models.py`	Pydantic models: objectives, checkpoints, constraints, reports
`display.py`	Rich terminal output (tables, panels, syntax highlighting)
`diagnostics.py`	Environment and workspace diagnostics (`crucis doctor`)
`cli/runner.py`	Subprocess wrapper for Claude/Codex agents
`core/loop.py`	Top-level fit and evaluation orchestration
`core/generation.py`	Test generation phase (generate → validate → review cycle)
`core/evaluation.py`	Evaluation phase (implement → verify retry cycle)
`core/verification.py`	Shared verification utilities (test running, constraint checking, holdout evals)
`core/_shared.py`	Shared helpers: preflight checks, test path collection, optimizer enqueueing, run logging
`core/planner.py`	Structured plan generation (`crucis run --plan`)
`prompts/__init__.py`	Jinja2 template engine for all prompt rendering
`prompts/_filters.py`	Custom Jinja2 filters (path_to_module, bool_label, readable_name)
`prompts/templates/`	Jinja2 templates for generation, adversary, probe, evaluation, plan, curriculum, onboarding
`core/prompts.py`	Thin prompt builder wrappers delegating to Jinja2 templates
`core/adversary.py`	Adversarial review, probe generation and execution
`core/curriculum.py`	Markdown curriculum from checkpoint + objective
`core/test_generator.py`	Python extraction from LLM responses
`intake/objective.py`	YAML objective parsing and validation
`intake/scaffold.py`	Workspace scaffolding and agent-driven onboarding (`crucis init`)
`constraints/loader.py`	Profile loading, constraint resolution, and auto-classification via the constraint registry
`constraints/checker.py`	AST-based static analysis (44 checks — each classified as required or advisory)
`constraints/_class_metrics.py`	Class-level metric checkers (methods, fields, lines, WMC per class)
`constraints/_module_metrics.py`	Module-level metric checkers (efferent coupling, maintainability index)
`constraints/_python_idioms.py`	Python idiom checkers (naming conventions, single-char names, unnecessary else, len-as-condition)
`execution/sandbox.py`	Docker-isolated pytest execution
`execution/optimizer.py`	Background policy optimization job management
`persistence/checkpoint.py`	Checkpoint creation, save, and load
`persistence/policy.py`	Policy I/O, candidate management, status tracking
`persistence/settings.py`	Settings schema and persistence
`persistence/events.py`	Structured JSONL event logging
`defaults.py`	Shared utilities: sanitized environment, text excerpting
`_compat.py`	Python 3.10 compatibility shim (StrEnum backport)
`gepa_optimizer.py`	GEPA-powered policy optimization worker

Agent Integration¶

Crucis uses CLI agents as subprocesses, not API calls:

Generation agent (default: claude) -- generates pytest train suites from objectives and constraints. Runs with --output-format json and empty --allowedTools.
Critic agent (default: claude) -- performs adversarial review of generated tests. Same invocation as generation.
Implementation agent (default: codex) -- writes code to pass the tests. Runs with --full-auto and access to Edit, Write, Read, Bash tools.

Agent selection and model are configured via environment variables or config.py:

Setting	Default	Purpose
`generation_agent`	`claude`	Agent for test generation
`generation_model`	`claude-opus-4-6`	Model for test generation
`critic_agent`	`claude`	Agent for adversarial review
`critic_model`	`claude-opus-4-6`	Model for adversarial review
`implementation_agent`	`codex`	Agent for code implementation
`implementation_model`	`gpt-5.3-codex`	Model for code implementation

MCP Server¶

The crucis/mcp/ package exposes every CLI capability as MCP tools for use by AI agents:

Module	Purpose
`mcp/server.py`	FastMCP instance with 19 tools, 7 resources, 5 prompts
`mcp/_workspace.py`	Workspace authorization, path traversal prevention, input validation

The server is a thin integration layer — all business logic stays in the modules listed above. See MCP Server for the full reference.

Verification Granularity¶

Crucis supports two verification modes set in the objective YAML:

task (default) -- each task is verified independently with its own pytest run and holdout check. Best for multi-task objectives where tasks are independent.
objective -- all tasks are verified together in a single pytest run. Holdout failures are reported as counts only (redacted payloads) to prevent information leakage.