Operator Runbook
Audience: operators running Animus in production.
This runbook covers healthchecks, common failure scenarios with remediations, debug procedures, backup/restore, upgrades, and disaster recovery. Every command in this guide is copy-pasteable against a current v0.4.x install.
For the upgrade procedure itself, see docs/migration/v0.4.11-to-v0.4.12.md and the cross-version docs/guides/upgrading.md.
1. Healthcheck procedures
animus daemon health
Fast point-in-time check. Spawns every discovered plugin one-shot, runs the handshake, calls health/check, and exits with non-zero on any Unhealthy row. Wall time is roughly one probe regardless of plugin count (probes fan out concurrently).
animus daemon health
animus daemon health --json # machine-readableThe daemon's own status stays Healthy even when individual plugins fail — plugin trouble is an observability concern, not a daemon liveness one.
animus daemon metrics
Counters, gauges, and histograms emitted by the daemon and the plugin host.
animus daemon metrics --pretty
animus daemon metrics --watch --interval-secs 5Useful counters to know:
| Metric | Meaning |
|---|---|
plugin_disabled_total{plugin=<name>} | Plugin exhausted its supervisor restart budget |
subscription_events_total{kind=<kind>} | Events fanned out to control-socket subscribers |
workflow_runs_total{status=<status>} | Workflow completion counts by status |
agent_runs_total{provider=<name>} | agent/run calls per provider |
animus daemon preflight
Required-plugins matrix. Exits non-zero if any required role is unsatisfied. The error message includes the exact animus plugin install ... command to fix each gap.
animus daemon preflight
animus daemon preflight --json # schema: animus.daemon.preflight.v1
animus daemon preflight --auto-install # fix gaps in placeExpected output when satisfied:
Daemon preflight OK
provider: 5 installed (claude, codex, gemini, opencode, oai)
subject: 5 installed (default, requirements, linear, sqlite, markdown)
transport+ui: 3 installed (transport-http, transport-graphql, web-ui)The same preflight runs at animus daemon start. Skip it for dev iteration with --skip-preflight; not recommended in production.
2. Common failure scenarios
Daemon won't start
error: daemon preflight failed
provider: 0 installed (need at least one)
fix: animus plugin install-defaultsThe daemon stops shipping with bundled providers, subjects, or transports as of v0.4.12. The startup preflight refuses to boot without at least one provider and one subject backend installed.
Fix:
animus plugin install-defaults --include-subjects --include-transports
animus daemon start --autonomousOr in one shot:
animus daemon start --autonomous --auto-installPlugin disabled by supervisor
The trigger and (in v0.5) provider supervisors give each plugin 5 restart attempts under exponential backoff. After the budget is exhausted, the plugin is disabled until the next daemon restart.
Diagnose:
animus daemon metrics --pretty | grep plugin_disabled_total
animus daemon health
animus plugin info --name <plugin-name> # now also fails early on missing required env
animus daemon logs --limit 200 # look for restart-loop messagesFix path:
- Read the plugin's
last_errorfromanimus daemon healthto identify the root cause (missing env var, upstream rate limit, crash bug). - Fix the underlying issue (set the env, wait out the rate limit, pin to a known-good plugin version).
- Restart the daemon to clear the disable flag:
animus daemon stop
animus daemon start --autonomousWorkflow blocked
animus workflow get <workflow-id>Look for blocked_reason. Common values:
"Blocked by status update"— the workflow ended inFailedand the task projector blocked it. Inspect the failing phase's output, fix the underlying issue, then move the task back toready:
animus subject status --kind task --id task:TASK-001 --status ready"Idempotency: phase requires explicit retry decision"— a phase without anidempotency: idempotentannotation crashed mid-execution. The daemon refuses to silently re-run side-effecting phases. Inspect, then if you've decided it's safe:
animus workflow resume <workflow-id> --force"Subject backend unavailable"— the subject plugin that owns this kind crashed or was disabled. Runanimus daemon healthand follow the plugin-disabled fix path above.
Tokens not streaming to UI
The web UI receives streaming events over the transport plugins. If events stop flowing:
- Verify transports are installed:
animus plugin list | grep -E 'transport|web-ui'If missing:
animus plugin install-defaults --include-transports- Verify subscribers are receiving:
animus daemon metrics --pretty | grep subscription_events_totalA flat counter means the daemon isn't producing events; an incrementing counter that the UI isn't rendering points at the transport plugin or the browser side.
- Known gap (v0.4.x → v0.5): workflow events emitted from subprocess-spawned workflow runners are not yet plumbed through the long-lived plugin host. Single-process workflows stream fine; the subprocess path is tracked for v0.5. See
docs/architecture/plugin-host-concurrency.md.
Plugin install fails signature
error: plugin signature verification failed
plugin: animus-subject-foo
policy: strict
reason: signature not trusted by built-in keysThe current default policy is warn, so installs record signature status without failing closed by default. Verification is keyless — trust is anchored on Sigstore Fulcio + Rekor (built into the cosign binary) plus the per-publisher identity regex; there is no PEM trust anchor to manage.
To temporarily skip verification for an unsigned local build:
animus plugin install <owner/repo> --signature-policy disabledTo verify a published artifact manually:
cosign verify-blob \
--certificate-identity-regexp "https://github.com/launchapp-dev/<repo>/.github/workflows/release\.yml@refs/tags/v.*" \
--certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
--bundle <asset>.tar.gz.bundle \
<asset>.tar.gzSee docs/reference/security.md and docs/architecture/plugin-signing.md for the full policy model. The pre-v0.4.12 --trust-key <PATH> flag is deprecated and a no-op.
3. Debug procedures
Enable JSON logs
ANIMUS_LOG_FORMAT=json animus daemon start --autonomousStructured logs ship to logs/events.jsonl under the scoped state root. Pipe through jq for filtering:
tail -F ~/.animus/<scope>/logs/events.jsonl | jq -c 'select(.level == "error")'Tail daemon logs
Use the CLI to read the active log storage backend. When the daemon is down, the command reads the local logs/events.jsonl mirror directly.
animus logs tail --level info --limit 100Startup and shutdown emit structured JSON lines (daemon_startup, daemon_shutdown), and workflow events include phase and run metadata.
Stream live events
animus daemon stream --pretty
animus daemon stream --workflow <wf-id> # filter by workflow
animus daemon stream --cat phase --level info # filter by categoryStreams structured log events across daemon, workflows, and runs. Useful for watching a single workflow in real time.
Check plugin logs
Each plugin spawns as a child process and inherits the daemon's stderr by default. Daemon-managed provider, subject, trigger, and log-storage plugins also run with cwd pinned to the resolved project_root, so repo-local state under .animus/ does not depend on which shell launched the daemon. To isolate a specific plugin:
animus plugin info --name <plugin-name> # locate the binary
animus plugin ping --name <plugin-name> # liveness + handshake; fails if env_required is missing
animus plugin call --name <plugin-name> \ # send raw JSON-RPC
--method health/check --params '{}'Long-running plugins (subjects, triggers, transports, and v0.5+ providers) route logs through the active log_storage_backend when one is installed. Daemon events still remain mirrored in ~/.animus/<scope>/logs/events.jsonl for local polling and fallback reads.
4. Backup and restore
Files to back up
| Path | Contains |
|---|---|
<project>/.animus/ | Project-local config + workflow YAML overlays |
~/.animus/<scope>/workflow.db | Persisted workflows, checkpoints, and subject-backed runtime records |
~/.animus/<scope>/logs/ | Runtime event logs and run logs |
~/.animus/<scope>/state/ | Compiled runtime config and supporting runtime state |
~/.animus/plugins/ | Installed plugin binaries + plugins.yaml |
~/.animus/trusted-signers.yaml | Cosign trust allowlist (if you customized it) |
<scope> is the repo-scope hash derived from the project's git common root. See protocol::scoped_state_root for the exact derivation.
Snapshot script
#!/bin/bash
SCOPE=$(animus doctor --json | jq -r '.repo_scope')
DEST=/backups/animus-$(date +%Y%m%d-%H%M%S)
mkdir -p "$DEST"
animus daemon stop # quiesce writers
tar czf "$DEST/project-animus.tar.gz" -C /path/to/project .animus
tar czf "$DEST/scoped-state.tar.gz" -C ~/.animus "$SCOPE"
tar czf "$DEST/plugins.tar.gz" -C ~/.animus plugins plugins.yaml trusted-signers.yaml 2>/dev/null
animus daemon start --autonomousRestore
animus daemon stop
tar xzf project-animus.tar.gz -C /path/to/project
tar xzf scoped-state.tar.gz -C ~/.animus
tar xzf plugins.tar.gz -C ~/.animus
animus daemon start --autonomous --auto-install # re-installs plugins if any missing
animus daemon preflight # verify
animus daemon health # verify5. Upgrade procedure
Cross-version steps live in docs/guides/upgrading.md.
For v0.4.11 → v0.4.12 specifically, follow docs/migration/v0.4.11-to-v0.4.12.md. The TL;DR:
animus daemon stop
curl -fsSL https://raw.githubusercontent.com/launchapp-dev/animus-cli/main/scripts/install.sh | bash
animus plugin install-defaults --include-subjects --include-transports
animus daemon preflight
animus daemon start --autonomousFor routine plugin updates:
animus plugin update # all installed plugins
animus plugin update <plugin-name> # specific plugin
animus plugin update <plugin-name> --tag v0.2.0 # pin to a tag
animus plugin update --dry-run # preview6. Disaster recovery
Crashed mid-workflow
The daemon writes <phase>.session.json for every in-flight phase ({ provider, session_id, status }). On restart the scheduler attempts provider.resume_agent through the original plugin.
Behavior by idempotency annotation:
| Annotation | Crash-recovery behavior |
|---|---|
idempotent | Auto-retried on daemon restart |
sideeffecting | Blocked on crash; operator decides |
unknown (default for unannotated phases) | Blocked on crash; operator decides |
Decision tree after a crash:
animus workflow list --status blocked
animus workflow get <id> # read blocked_reason
animus subject get --kind task --id <task-id> # context
# If you've decided it's safe to resume:
animus workflow resume <id> --force
# If it should be re-run from scratch:
animus workflow cancel <id>
animus subject status --kind task --id <task-id> --status ready
# If the provider plugin doesn't support resume:
# Install the matching plugin or change the agent's provider in workflow YAML
animus plugin install launchapp-dev/animus-provider-claudeDaemon corrupt state
If ~/.animus/<scope>/state/ is corrupt (rare):
animus daemon stop
mv ~/.animus/<scope>/state ~/.animus/<scope>/state.broken-$(date +%s)
animus daemon start --autonomous # daemon rebuilds compiled config
animus daemon preflight # verifyWorkflow YAML overlays in .animus/workflows*.yaml are preserved because they are project-local authored config. Subject and workflow runtime state now lives under ~/.animus/<scope>/, so restore the scoped runtime snapshot if you need the operational state back as well.
Provider plugin returns garbage
ANIMUS_PROVIDER_DISABLE_PLUGIN was removed in v0.4.12 — there is no in-tree fallback to switch to, so the kill-switch had nothing left to do. To quarantine a specific provider plugin, move it out of discovery or uninstall it:
animus daemon stop
animus plugin uninstall --name <plugin-name>
animus daemon start --autonomousAny phase that targets the removed provider will Block with a hard error pointing at the install/uninstall surface, instead of silently routing to a missing in-tree implementation.
To disable subject discovery entirely:
ANIMUS_DAEMON_DISABLE_SUBJECT_PLUGINS=1 animus daemon start --autonomousTo skip the trigger supervisor:
ANIMUS_DAEMON_DISABLE_TRIGGERS=1 animus daemon start --autonomousAll three kill switches require a daemon restart to take effect and to re-enable. See docs/reference/configuration.md#plugin-kill-switches.
7. Quick smoke test
After any non-trivial change (upgrade, plugin install, env update):
animus --version
animus plugin list
animus daemon preflight
animus daemon start --autonomous
animus daemon health
animus subject list --kind task
animus workflow list --limit 5If any step fails, animus doctor runs a broader environment check and prints actionable remediation.
Related docs
- Daemon Operations — start/stop/pause/configure
- Troubleshooting — issue catalog with fixes
- Upgrading — cross-version upgrade procedure
- v0.4.11 → v0.4.12 migration — the plugin-extraction cut
- Plugin Author Guide — what plugins look like from the inside
- Architecture Diagram — runtime topology