"
just release-run TICKETfailed mid-phase. What now?"
The orchestrator is halt-on-failure between every phase, with a state file + audit log so you always know where you stopped. Five-step decision tree from any failure.
Every release-run writes two files in .asd/workspace/releases/:
<ticket>.env — state file (BASE_SHA, MERGE_SHA, SERVICES, BACKUPS)<ticket>.log — per-phase audit log (start / done / failed events)These are what you read first when something goes wrong.
tail .asd/workspace/releases/<ticket>.log
# Look for the line containing event=failed
# e.g.: 2026-05-18T... ticket=1234 phase=prepare event=failed rc=1
Phase determines what to do next:
| Phase | What went wrong | Common causes |
|---|---|---|
prepare |
Git state or backup failed before any deploy mutation | Dirty working tree, BASE_SHA == MERGE_SHA (pre-pulled), backup script crashed |
migrate |
migrate.sh failed |
A service didn't come up after force-recreate, health probe race, image pull failure |
ticket |
releases/<ticket>-*.sh failed |
Per-ticket logic bug, stale env var, network issue |
verify |
Drift check / health probe failed | Service unhealthy after recreate, image tag mismatch |
The release-run output already gave you a hint. Re-read it:
release-run failed in phase <phase>. To roll back:
just release-revert <ticket>
Sometimes the recovery hint is more specific (e.g. for prepare-phase BASE_SHA collisions it tells you to git reset --hard <pre-pull-sha>).
recovery/playbooks/release-*.yaml contains structured entries for known failure modes. The current set:
| Playbook | When it applies |
|---|---|
release-verify-race.yaml |
Phase 4 fails with "unhealthy" + container becomes healthy 30s later (fixed in #3918 — retry helper added) |
release-run-exit-5-with-drift.yaml |
Exit code 5 with nothing to do despite src/ changes (fixed in #3851 — src/ classifier added) |
release-orchestrator-source-drift.yaml |
Phase 1 detects services as affected but phase 2 has no migrate.sh entries |
Match symptom in the playbook entry's symptom: field. If matched, the playbook tells you the detection commands + fix + verify.
This is the non-skip step. Per Golden Rule 0: no ad-hoc fixes — fix the cause.
Common causes by phase:
[ERROR] Working tree is dirty. Commit, stash or discard before release.
→ Look at git status. Someone left work-in-progress on prod. Don't blindly stash — check what's there first. Often it's a legitimate WIP that needs committing.
[ERROR] already up-to-date with origin/main — nothing to release
→ Someone git pull-ed on prod before release-run. The orchestrator captures BASE_SHA pre-pull; pre-pulling collapses BASE_SHA == MERGE_SHA. Recovery: git reflog, git reset --hard <pre-pull-sha>, re-run release-run.
✗ wikijs.healthy: unhealthy
✗ http-health: HTTP 0
→ Service didn't come up fast enough after force-recreate. Check container status manually (just status wikijs). If now healthy: race condition (this is what #3918's retry helper fixes). If still unhealthy: check container logs (just logs wikijs) for the real error.
Error response from daemon: pull access denied for ...
→ Image pull failure. Check the image tag in manifest is correct + image exists at that tag in the registry.
The per-ticket script (releases/<ticket>-*.sh) is just bash. Read it + reproduce the failing command manually to find the real issue.
[WIKIJS] ERROR: could not read theming config: Forbidden
→ Stale API key (this is what #3919's post_start_init.ts fix prevents). Recovery: just wikijs-configure regenerates the key.
✗ <service>.image: got X:1, expected X:2
→ Image tag in the running container doesn't match the manifest. Usually means the force-recreate in phase 2 didn't actually pick up the new image. Check if a pull step was needed.
✗ http-health: HTTP 0
→ Service is up but not responding on the health endpoint. Inspect the service directly, then either bump the timeout / wait longer, or fix the actual health issue.
Once the root cause is fixed:
Resume from the failed phase:
just release-run <ticket> --from=<phase>
--from=migrate if prepare succeeded but migrate failed. --from=ticket if migrate succeeded but ticket-script failed. --from=verify if everything succeeded but verifier flaked.
The state file from phase 1 is used (BASE_SHA, MERGE_SHA, BACKUPS), so no re-prep needed.
Or revert if unfixable:
just release-revert <ticket>
Restores the backups taken in phase 1, reverts the merge commit. Use when (a) the bug is in the released code itself, (b) the migration left state you can't roll forward through, or (c) you've burned an hour and need to ship a fix tomorrow.
If your failure mode wasn't in recovery/playbooks/, add it. Per Golden Rule 11, every bootstrap/release fix needs a playbook entry. The next operator (or AI agent) walks the same path you just walked, except with a documented recipe instead of guesswork.
Template:
- id: <kebab-case-id>
symptom: |
Free-form description of what the operator sees. Include
the actual error message they'll grep for.
detection:
- command: <bash command to confirm this is the issue>
expect: <what the output should be if this playbook applies>
root_cause: |
Why this happens. Be honest about which layer is at fault.
fix:
bootstrap_script: <file:fn that needs fixing>
bootstrap_change: <description of the permanent fix>
adhoc: |
<commands to fix the running env now>
verify:
- <command to confirm the fix worked>
services: [list]
tags: [release, ...]
File it under recovery/playbooks/release-<topic>.yaml and include in the same PR as the actual fix.
/pma/learn/06-release-and-rollback — the four phases in detail.