Backup, Restore, and Disaster Recovery
Owner: Ops / Backend Lead Last Edited: March 26, 2026 Last Reviewed: March 26, 2026
1. Availability Objectives
| Objective | Target |
|---|---|
| Recovery Point Objective (RPO) | 24 hours (daily backup minimum) |
| Recovery Time Objective (RTO) | 4 hours (restore from backup to serving traffic) |
| Uptime target | 99.5% monthly |
These are initial targets. Revise upward as the business requires.
2. What Needs Backing Up
| Data | System | Backup owner | Notes |
|---|---|---|---|
| Primary database | PostgreSQL (Supabase/RDS) | Managed service + Ops | All org-scoped data: buildings, comps, TIMs, reports, memberships, audit logs, exports |
| Import staging files | Disk (IMPORT_UPLOAD_DIR) | Ops | CSV uploads staged before processing. Low persistence requirement — these are transient. |
| Redis cache | Redis (Upstash/ElastiCache) | Managed service | Cache only — no durable state. Redis loss is recoverable by cache warming, not a data loss event. |
Redis is not a backup concern — it is a cache. If Redis is lost entirely, the API continues to work (cache misses fall through to the DB) with a performance degradation.
Import staging files are transient — they are used to process an import job and are not the source of truth for the imported data. The imported records in Postgres are what matters.
3. Backup Schedule
PostgreSQL
Use the backup mechanism provided by your managed DB service:
Supabase:
- Daily automatic backups (PITR available on Pro plan).
- Retention: 7 days (default) — extend to 30 days if feasible.
- Manual snapshot: Supabase dashboard → Database → Backups → Take backup.
AWS RDS:
- Enable automated backups with a retention period of at least 7 days.
- Enable PITR (Point-In-Time Recovery).
- Create a manual snapshot before every production migration.
Minimum acceptable baseline:
- 1 daily automated backup, retained for 7 days.
- 1 manual snapshot taken immediately before any production migration.
4. Restore Procedure
4.1 PostgreSQL — restore from managed backup
Supabase:
- Go to Supabase dashboard → Database → Backups.
- Select the backup point to restore to.
- Click Restore. This creates a new project or restores in-place depending on your plan.
- Update
DATABASE_URLin the production environment to point at the restored DB if the endpoint changed. - Restart the API container.
- Verify
/readyreturns 200. - Smoke test key API endpoints.
AWS RDS:
- Go to RDS console → Snapshots (or Automated Backups for PITR).
- Select the restore point.
- Restore to a new DB instance (recommended — keeps the old instance intact for forensics).
- Update
DATABASE_URLin the production environment to the new instance endpoint. - Update DNS/connection strings as needed.
- Restart the API container.
- Verify
/readyreturns 200. - Smoke test.
4.2 Full disaster recovery (instance + DB loss)
- Provision a new VM/container host.
- Deploy the API container using
go-backend/ops/deploy/docker-compose.yml. - Restore the DB from backup (see 4.1).
- Restore all production secrets in the new environment's secret store.
- Reconnect the Cloudflare tunnel or LB to the new host.
- Verify
/healthand/ready. - Run k6 smoke gate.
Estimated RTO from a full loss scenario: 2–4 hours assuming managed DB restore is available.
5. Pre-Migration Backup Checklist
Before every production database migration:
- Take a manual DB snapshot (see Section 3).
- Record the snapshot ID/timestamp.
- Confirm the snapshot completed successfully before applying the migration.
- Keep the snapshot for at least 30 days post-migration.
6. Restore Drill
A restore drill must be performed at least quarterly to verify the restore process actually works.
Drill procedure:
- Identify a non-production environment to restore into (staging or a fresh test instance).
- Restore from the most recent automated backup.
- Verify the restored DB contains expected data (spot-check a few records from each major table).
- Verify the API can connect to the restored DB (point a staging API at it temporarily).
- Record the drill result (date, backup point used, outcome, any issues).
Store drill records in docs/audit/restore-drills/.
Restore drill log
| Date | Backup point used | Environment | Result | Notes | Completed by |
|---|---|---|---|---|---|
| — | — | — | — | — | — |
7. Degraded Mode Behavior
| Dependency down | API behavior |
|---|---|
| Redis | Cache misses fall through to DB. All endpoints continue to work. Performance degrades. |
| DB | API returns 503 for all data endpoints. /health still returns 200; /ready returns 503. |
| Clerk (JWKS endpoint) | Auth failures if JWKS cache expires. The backend caches JWKS and refreshes in background — short outages are tolerated. Extended Clerk outage blocks all user auth. |
| Cloudflare / edge | API is unreachable from the public internet. The API process itself continues running on the VM loopback. |
8. Related Docs
docs/ops/production-runbook.md— deploy, rollback, migration stepsdocs/security/incident-response.md— what to do during an outagedocs/security/secrets-management.md— secret rotation (needed after a restore)