Backup is a
manifest.yamlcheckbox, not a script someone has to remember to write. Restore is symmetric: read the same manifest, walk thefixes.post_restore.functions, services come back. This rung is about building the muscle to lose data without losing sleep.
Level: 5 · Reading time: 20 min
Already know this rung? Skip to Level 6 — release and rollback (planned) where you'll use this same backup machinery to make production deploys atomic.
PMA's backup model: each service declares its own backup strategy in the manifest. The standard just backup <svc> flow reads those fields and snapshots the right thing — a database dump for postgres-backed services, a tar of the data volume for file-backed services, an .env slice for stateless services.
Restore is mechanical: read the same manifest, untar / restore SQL, then walk any service-specific quirks declared under fixes.post_restore.functions. The wikijs UID-drift fix shipped earlier in #3920 is a real example — a post-restore function that chowns the data volume to the right user.
In this rung we'll: (1) take a backup of every running service, (2) deliberately break one service, (3) restore it, (4) understand the per-service quirks via the fixes.post_restore hooks.
Per service:
just backup redmine
# Reads packages/redmine/manifest.yaml backup.* fields,
# snapshots the postgres database to
# .asd/workspace/backups/redmine/<timestamp>/dump.sql,
# rotates older backups per backup.retention policy.
All services in the active profile:
just backup
# Iterates over every service whose manifest.yaml has
# backup.enabled: true. Sequential by default; --parallel
# for faster but heavier I/O.
List what's been backed up:
just list-backups
# Shows latest backup per service + age + size.
just list-backups redmine
# Shows full backup history for one service.
We'll use Redmine. You will lose any data created since the last
backup — practise this on a dev install first.
# 1. Take a fresh backup of the current state
just backup redmine
# 2. Note the timestamp from the output (e.g., 20260518_1430)
TIMESTAMP=20260518_1430 # use the real one
# 3. Break it deliberately — drop the Redmine database
just redmine-psql -c "DROP DATABASE redmine;"
just redmine-psql -c "CREATE DATABASE redmine;"
# 4. Verify it's broken
just status redmine
# redmine container: unhealthy
# Or: open the Redmine URL → 500 Internal Server Error
# 5. Restore
just restore redmine $TIMESTAMP
# Reads manifest, restores dump.sql,
# walks fixes.post_restore.functions (e.g. unlock migration table),
# restarts the container.
# 6. Verify
just status redmine
# redmine container: healthy
# Open Redmine URL → login → see your projects back
If anything went wrong during the restore, the most recent
backup is still on disk — start over from step 5 with a different
TIMESTAMP.
Some services need post-restore fixes that pure
"untar-and-restart" doesn't cover. The manifest declares them:
# packages/redmine/manifest.yaml (excerpt)
fixes:
post_restore:
script: scripts/post_restore.py
functions:
- name: fix_migration_lock
description: Unlock stuck migration table after restore
targets: database
Each function is a small Python defined in
packages/<svc>/scripts/post_restore.py. The standard
restore_service.py walks the list and runs each one.
Real examples from PMA:
| Service | Post-restore function | Why |
|---|---|---|
| Redmine | fix_migration_lock |
Postgres restore can leave migration table in a "running" state; unlock it. |
| Wiki.js | chown_data_volume (shipped in #3920) |
UID can drift between image variants; chown to current container user. |
| Mautic | regenerate_session_secret |
Sessions from the dump are invalid; generate a fresh secret. |
| Postal | rerun_rabbitmq_setup |
Restored RabbitMQ doesn't know about new vhost; re-declare. |
| Authentik | clean_oauth_clients |
OAuth clients from source environment leak — drop them, let bootstrap re-create. |
Adding a new fix:
packages/<svc>/scripts/post_restore.py.manifest.yaml under fixes.post_restore.functions.recovery/playbooks/services/ so the next operator knows when it kicks in.Per Golden Rule 11, every bootstrap fix needs a playbook entry.
This is the operational discipline that keeps recovery
machine-readable.
After a successful restore drill:
$ just list-backups redmine
20260518_1430 Today 45 MB
20260518_0800 Today 45 MB
20260517_1430 Yesterday 44 MB
...
$ just status redmine
[asd-dev-redmine] ✓ healthy
[asd-dev-redmine-postgres] ✓ healthy
$ curl -sI ${REDMINE_URL} | head -1
HTTP/2 200
You've proven: (a) backups happen, (b) restoring from a backup
brings a service all the way back, (c) post-restore quirks are
declared in the manifest, not hidden in someone's head.
Three things to internalise:
Backup type matches service shape. backup.type: database → dump the postgres/mariadb. backup.type: volume
→ tar the named volume. backup.type: workspace → tar the
workspace dir. backup.type: config → snapshot a .env
slice. The framework picks the right snapshot mechanism based
on the manifest, no per-service backup script needed.
Restore is the inverse of backup, with fixes.post_restore
for the messy parts. A backup is "snapshot the data, save
the timestamp". A restore is "load the snapshot back, then
run the manifest's declared post-restore functions". The
messy parts (UID drift, stale OAuth clients, migration table
locks) are documented as functions rather than tribal
knowledge.
Backups are local files in .asd/workspace/backups/.
No cloud storage required (though you can rsync them off-host
if you want). Discoverable with just list-backups. Per-service
rotation per the backup.retention declaration.
For multi-service restore (e.g. after a full disk loss): the
order matters — Authentik first (every other service depends on
it for SSO), databases next, then services. just restore-all
does this in the right order per the startup_priority from
each manifest.
Reference: /pma/internals/architecture (planned) — the
backup pillar; recovery/playbooks/ — the documented quirks.
asd data push step to a cron job; asd data push is asd's snapshot upload mechanism (see /asd/internals/registry).--mode=restore on just bootstrap-local. Wiki.js's manifest+hooks were hardened during the #3919/#3920 release for exactly this case./pma/internals/recovery-playbooks (planned) — how to add a new playbook entry.