Property PrismDev Hub

Incident Response

Severity levels, triage process, containment steps, and postmortem requirements for security and operational incidents.

Updated Apr 3, 2026

Owner: Backend Lead / On-call Last Edited: March 26, 2026 Last Reviewed: March 26, 2026


1. Severity Levels

SeverityDefinitionResponse targetExample
P1 — CriticalProduction down or confirmed data breachImmediate (< 15 min to triage)API returning 500 for all requests; confirmed cross-org data leak; secret confirmed compromised in prod
P2 — HighSignificant degradation or credible security threat< 1 hour to triageElevated 5xx rate; repeated auth failures from a single source; Dependabot critical vuln in prod dep
P3 — MediumLimited impact; workaround available< 4 hoursSingle endpoint misbehaving; Slack alerts not routing; staging environment issue
P4 — LowInformational or minorNext business dayDependency alert (non-critical); doc drift; non-urgent config cleanup

2. Incident Lifecycle

Detection → Triage → Containment → Eradication → Recovery → Postmortem

2.1 Detection

Incidents may be detected via:

  • Sentry alert (error spike, new error group, performance degradation)
  • Render service metrics anomaly (CPU, memory, restart)
  • External report (user, partner, security researcher)
  • GitHub secret scanning alert
  • Dependabot critical CVE alert
  • Manual observation

2.2 Triage (first 15 min for P1)

  1. Acknowledge the alert (Sentry notification or team Slack channel).
  2. Determine severity using the table above.
  3. Check /health and /ready — if either is down, the outage is infrastructure-level.
  4. Check Render logs for the backend service.
  5. Check Sentry for the time window of the anomaly (error groups, breadcrumbs, performance traces).
  6. Identify: is this a production outage, a security incident, or both?

2.3 Containment

Production outage:

  • If a bad deploy caused it: roll back immediately (see docs/ops/production-runbook.md Section 8).
  • If a dependency (DB, Redis) is down: notify the managed service provider and monitor.

Security incident (compromised secret, data exposure, auth bypass):

  • Revoke the affected credential immediately. Rotate it in the upstream service and update the value in Infisical (which auto-syncs to Render/Cloudflare). See docs/security/secrets-management.md.
  • If a user account is compromised: deprovision in Clerk immediately.
  • If cross-org data was accessed: identify affected org IDs from audit logs.
  • Do not delete logs — preserve evidence before any cleanup.

Active attack (repeated auth failures, rate limit abuse):

  • Block the source IP at the Cloudflare/edge level.
  • Do not block at the application level only — the edge block is faster and more reliable.

2.4 Eradication

  • Patch or revert the root cause (bad deploy, vulnerable dep, misconfiguration).
  • Rotate all secrets that may have been exposed.
  • Remove malicious access (revoke API keys, deprovision compromised accounts).

2.5 Recovery

  • Re-deploy the clean/patched version.
  • Verify with /health, /ready, and k6 smoke gate.
  • Confirm monitoring and alerts are back to baseline.
  • Notify affected parties if data was exposed.

2.6 Postmortem

Every P1 and P2 incident requires a postmortem within 5 business days.

Postmortem template:

## Incident: <title>
Date: YYYY-MM-DD
Severity: P1 / P2
Duration: <start> to <end>
Author: <name>

### What happened
<timeline of events>

### Root cause
<what actually caused it>

### Detection
<how was it found, and how long did detection take>

### Impact
<what was affected, how many users/orgs, any data exposure>

### Containment and recovery steps taken
<what was done and when>

### What went well
<things that worked>

### What to improve
<gaps in detection, response, or process>

### Action items
| Item | Owner | Due |
|---|---|---|

Store postmortems in docs/audit/incidents/.


3. Specific Playbooks

3.1 Compromised secret

  1. Identify which secret is compromised and in which environment.
  2. Rotate immediately using the playbook in docs/security/secrets-management.md.
  3. Audit access logs for the time window the secret was exposed.
  4. Identify any actions taken using the compromised credential.
  5. If customer data was potentially accessed: escalate to P1, notify affected orgs.
  6. Remove the secret from any git history or logs where it appeared.
  7. File postmortem.

3.2 Confirmed cross-org data access

  1. Escalate to P1 immediately.
  2. Pull audit logs for the affected time window:
    • GET /api/v1/audit-logs (requires audit:read scope)
    • Filter by the attacker's org ID and the victim's org ID
  3. Identify exactly which records were accessed.
  4. Notify affected org(s).
  5. Determine root cause (code bug? compromised credential? config error?).
  6. Patch, deploy, verify.
  7. File postmortem.

3.3 Production outage (API down)

  1. Check container status: docker compose ps.
  2. Check logs: docker compose logs --tail=200 api.
  3. Check DB: can you reach /ready? If not, the DB or Redis may be down.
  4. If a deploy caused it: roll back (see docs/ops/production-runbook.md).
  5. If infrastructure: contact the managed service provider.
  6. Update status in Slack throughout.

3.4 Elevated 5xx rate (no full outage)

  1. Check logs for the error pattern (Render logs or Sentry breadcrumbs).
  2. Check Sentry — is it one endpoint or all?
  3. Check if a recent deploy correlates.
  4. If a deploy correlates: roll back.
  5. If no deploy: investigate the specific error in logs (DB timeout? validation panic? upstream dep?).

3.5 Dependabot critical CVE

  1. Review the CVE and assess whether the vulnerable code path is reachable in production.
  2. If reachable: treat as P2 and patch within 24 hours.
  3. If not reachable: treat as P3, patch within the next sprint.
  4. Track in the vulnerability management log (docs/security/vulnerability-management.md).

4. On-Call and Escalation

  • Primary on-call: [assign name/role]
  • Secondary escalation: [assign name/role]
  • Alert channels: Sentry notifications, team Slack channel
  • All P1/P2 incidents must have a named incident commander within 15 minutes of detection.

  • docs/ops/production-runbook.md — rollback and deploy procedures
  • docs/security/secrets-management.md — secret rotation playbooks
  • docs/security/access-control-policy.md — account deprovision steps
  • docs/security/backup-restore-and-dr.md — restore from backup
  • docs/security/vulnerability-management.md — CVE triage and patch SLAs