Backup, Restore, and Disaster Recovery

Backup schedule, restore procedure, DR targets, and tested restore drill record for Project Prism.

Updated May 4, 2026

Owner: Ops / Backend Lead Last Edited: March 26, 2026 Last Reviewed: March 26, 2026

1. Availability Objectives

Objective	Target
Recovery Point Objective (RPO)	24 hours (daily backup minimum)
Recovery Time Objective (RTO)	4 hours (restore from backup to serving traffic)
Uptime target	99.5% monthly

These are initial targets. Revise upward as the business requires.

2. What Needs Backing Up

Data	System	Backup owner	Notes
Primary database	PostgreSQL (Supabase/RDS)	Managed service + Ops	All org-scoped data: buildings, comps, TIMs, reports, memberships, audit logs, exports
Import staging files	Disk (`IMPORT_UPLOAD_DIR`)	Ops	CSV uploads staged before processing. Low persistence requirement — these are transient.
Redis cache	Redis (Upstash/ElastiCache)	Managed service	Cache only — no durable state. Redis loss is recoverable by cache warming, not a data loss event.

Redis is not a backup concern — it is a cache. If Redis is lost entirely, the API continues to work (cache misses fall through to the DB) with a performance degradation.

Import staging files are transient — they are used to process an import job and are not the source of truth for the imported data. The imported records in Postgres are what matters.

3. Backup Schedule

PostgreSQL

Use the backup mechanism provided by your managed DB service:

Supabase:

Daily automatic backups (PITR available on Pro plan).
Retention: 7 days (default) — extend to 30 days if feasible.
Manual snapshot: Supabase dashboard → Database → Backups → Take backup.

AWS RDS:

Enable automated backups with a retention period of at least 7 days.
Enable PITR (Point-In-Time Recovery).
Create a manual snapshot before every production migration.

Minimum acceptable baseline:

1 daily automated backup, retained for 7 days.
1 manual snapshot taken immediately before any production migration.

4. Restore Procedure

4.1 PostgreSQL — restore from managed backup

Supabase:

Go to Supabase dashboard → Database → Backups.
Select the backup point to restore to.
Click Restore. This creates a new project or restores in-place depending on your plan.
Update DATABASE_URL in the production environment to point at the restored DB if the endpoint changed.
Restart the API container.
Verify /ready returns 200.
Smoke test key API endpoints.

AWS RDS:

Go to RDS console → Snapshots (or Automated Backups for PITR).
Select the restore point.
Restore to a new DB instance (recommended — keeps the old instance intact for forensics).
Update DATABASE_URL in the production environment to the new instance endpoint.
Update DNS/connection strings as needed.
Restart the API container.
Verify /ready returns 200.
Smoke test.

4.2 Full disaster recovery (instance + DB loss)

Provision a new VM/container host.
Deploy the API container using go-backend/ops/deploy/docker-compose.yml.
Restore the DB from backup (see 4.1).
Restore all production secrets in the new environment's secret store.
Reconnect the Cloudflare tunnel or LB to the new host.
Verify /health and /ready.
Run k6 smoke gate.

Estimated RTO from a full loss scenario: 2–4 hours assuming managed DB restore is available.

5. Pre-Migration Backup Checklist

Before every production database migration:

Take a manual DB snapshot (see Section 3).
Record the snapshot ID/timestamp.
Confirm the snapshot completed successfully before applying the migration.
Keep the snapshot for at least 30 days post-migration.

6. Restore Drill

A restore drill must be performed at least quarterly to verify the restore process actually works.

Drill procedure:

Identify a non-production environment to restore into (staging or a fresh test instance).
Restore from the most recent automated backup.
Verify the restored DB contains expected data (spot-check a few records from each major table).
Verify the API can connect to the restored DB (point a staging API at it temporarily).
Record the drill result (date, backup point used, outcome, any issues).

Store drill records in docs/audit/restore-drills/.

Restore drill log

Date	Backup point used	Environment	Result	Notes	Completed by
—	—	—	—	—	—

7. Degraded Mode Behavior

Dependency down	API behavior
Redis	Cache misses fall through to DB. All endpoints continue to work. Performance degrades.
DB	API returns 503 for all data endpoints. `/health` still returns 200; `/ready` returns 503.
Clerk (JWKS endpoint)	Auth failures if JWKS cache expires. The backend caches JWKS and refreshes in background — short outages are tolerated. Extended Clerk outage blocks all user auth.
Cloudflare / edge	API is unreachable from the public internet. The API process itself continues running on the VM loopback.

docs/ops/production-runbook.md — deploy, rollback, migration steps
docs/security/incident-response.md — what to do during an outage
docs/security/secrets-management.md — secret rotation (needed after a restore)