Machine-readable runbooks for known failure modes. YAML in
recovery/playbooks/. Maps symptom → detection → root cause → fix → verify. A lesser agent (or junior operator) can read the playbook and execute a fix without needing the original debugger's context.
Per Golden Rule 11, every bootstrap / release fix MUST include a playbook
entry. Knowledge that lives only in someone's head is lost the next time
the failure occurs.
recovery/
└── playbooks/
├── sso-oauth-redirect-mismatch.yaml
├── sso-saml-metadata-stale.yaml
├── release-verify-race.yaml
├── release-run-exit-5-with-drift.yaml
├── release-orchestrator-source-drift.yaml
├── containers-stuck-restarting.yaml
├── caddy-routing-404.yaml
├── database-migration-locked.yaml
└── ...
Files are organised by category (sso-, release-, containers-,
caddy-, database-*). Each file is one entry; some categories have
multiple files.
- id: release-verify-race
symptom: |
`just release-run TICKET` fails in phase 4 (verify) with:
✗ wikijs.healthy: unhealthy
✗ http-health: HTTP 0
But `just status wikijs` 30 seconds later shows healthy.
detection:
- command: just release-state TICKET
expect: |
Output shows phase=verify with rc != 0.
- command: just status SERVICE
expect: |
Container healthy 30+ seconds after failure.
root_cause: |
Phase 4 verify probes the service immediately after force-recreate
in phase 2. For slow-starting services (Wiki.js, Authentik, ERPNext)
the container needs 30-60s before the health endpoint stabilises.
The original verify had no retry, so a single probe could fire
during the unstable window.
fix:
bootstrap_script: scripts/release/verify.ts
bootstrap_change: |
Wrap the http-health probe in a retry helper:
retryWithBackoff(probe, { maxAttempts: 6, baseDelayMs: 5000 })
First retry at 5s, then 10s, 20s, 40s, 80s, 160s.
adhoc: |
For the current stuck release: re-run just verify:
just release-run TICKET --from=verify
verify:
- just release-state TICKET
- just status SERVICE
services: [wikijs, authentik, erpnext]
tags: [release, race-condition, verify-phase]
fixed_in: '#3918'
related: [release-orchestrator-source-drift]
| Field | Purpose |
|---|---|
id |
Stable identifier (kebab-case). Referenced from commits + cross-playbook links. |
symptom |
Free-form description matching what an operator (or agent) actually sees. Include the verbatim error message they'll grep for. |
detection |
Commands to confirm "yes, this playbook applies". expect: block says what the output should look like. |
root_cause |
The actual mechanism. Don't shortcut to the fix — explain WHY. Future-you will thank you. |
fix.bootstrap_script |
File path of the script where the permanent fix lives. (Required per Golden Rule 8.) |
fix.bootstrap_change |
One-paragraph description of the fix. |
fix.adhoc |
Commands to fix the running environment now, after the bootstrap fix is in place. |
verify |
Commands that confirm the fix worked. |
services |
Which services are affected. |
tags |
Free-form for filtering / searching. |
fixed_in |
Ticket number where this was first fixed. |
related |
Other playbook IDs that are commonly co-occurring. |
The grep step is what makes this work for AI agents too — Claude (or a
junior operator) reads the failure log, greps for known symptoms, finds the
matching playbook, executes the fix. Without playbooks, that knowledge is
in the head of whoever debugged the issue first.
When you diagnose a new failure mode:
The "playbook first" ordering matters. After you've fixed the issue, you
forget what was confusing about it. The window to capture the confusion
is right after diagnosis.
Good symptom field: verbatim error message + the operator's mental
context. "I got error X but I thought I was doing Y" framing.
Good root_cause field: explains the mechanism, not just the rule. "We
got burned by this in Q3 because of Y" > "always do X".
Good fix field: names the specific file + function where the permanent
fix lives. Future-you can navigate.
Good verify field: commands that confirm the fix, not "check the
service works".
Bad playbook entry: "Run X. It works." This is a recipe, not a
playbook. The next operator will hit the same issue + add a duplicate
recipe.
# Find playbooks by category
ls recovery/playbooks/ | grep -i sso-
# Search by tag
grep -l "tag: race-condition" recovery/playbooks/*.yaml
# Search by error message you're seeing
grep -l "BASE_SHA == MERGE_SHA" recovery/playbooks/*.yaml
# Validate playbook YAML
just contract-validate recovery/playbooks
/pma/cookbook/debug-a-failed-release — the operator's decision tree that uses playbooks./pma/internals/release-orchestrator — where the failure logs live (.asd/workspace/releases/)./asd/cookbook/debug-a-broken-route — asd-side equivalent decision tree.CLAUDE.md — the rule that makes playbooks non-optional.