Incident Response

Severity levels, triage process, containment steps, and postmortem requirements for security and operational incidents.

Updated May 4, 2026

Owner: Backend Lead / On-call Last Edited: March 26, 2026 Last Reviewed: March 26, 2026

1. Severity Levels

Severity	Definition	Response target	Example
P1 — Critical	Production down or confirmed data breach	Immediate (< 15 min to triage)	API returning 500 for all requests; confirmed cross-org data leak; secret confirmed compromised in prod
P2 — High	Significant degradation or credible security threat	< 1 hour to triage	Elevated 5xx rate; repeated auth failures from a single source; Dependabot critical vuln in prod dep
P3 — Medium	Limited impact; workaround available	< 4 hours	Single endpoint misbehaving; Slack alerts not routing; staging environment issue
P4 — Low	Informational or minor	Next business day	Dependency alert (non-critical); doc drift; non-urgent config cleanup

2. Incident Lifecycle

Detection → Triage → Containment → Eradication → Recovery → Postmortem

2.1 Detection

Incidents may be detected via:

Sentry alert (error spike, new error group, performance degradation)
Render service metrics anomaly (CPU, memory, restart)
External report (user, partner, security researcher)
GitHub secret scanning alert
Dependabot critical CVE alert
Manual observation

2.2 Triage (first 15 min for P1)

Acknowledge the alert (Sentry notification or team Slack channel).
Determine severity using the table above.
Check /health and /ready — if either is down, the outage is infrastructure-level.
Check Render logs for the backend service.
Check Sentry for the time window of the anomaly (error groups, breadcrumbs, performance traces).
Identify: is this a production outage, a security incident, or both?

2.3 Containment

Production outage:

If a bad deploy caused it: roll back immediately (see docs/ops/production-runbook.md Section 8).
If a dependency (DB, Redis) is down: notify the managed service provider and monitor.

Security incident (compromised secret, data exposure, auth bypass):

Revoke the affected credential immediately. Rotate it in the upstream service and update the value in Infisical (which auto-syncs to Render/Cloudflare). See docs/security/secrets-management.md.
If a user account is compromised: deprovision in Clerk immediately.
If cross-org data was accessed: identify affected org IDs from audit logs.
Do not delete logs — preserve evidence before any cleanup.

Active attack (repeated auth failures, rate limit abuse):

Block the source IP at the Cloudflare/edge level.
Do not block at the application level only — the edge block is faster and more reliable.

2.4 Eradication

Patch or revert the root cause (bad deploy, vulnerable dep, misconfiguration).
Rotate all secrets that may have been exposed.
Remove malicious access (revoke API keys, deprovision compromised accounts).

2.5 Recovery

Re-deploy the clean/patched version.
Verify with /health, /ready, and k6 smoke gate.
Confirm monitoring and alerts are back to baseline.
Notify affected parties if data was exposed.

2.6 Postmortem

Every P1 and P2 incident requires a postmortem within 5 business days.

Postmortem template:

## Incident: <title>
Date: YYYY-MM-DD
Severity: P1 / P2
Duration: <start> to <end>
Author: <name>

### What happened
<timeline of events>

### Root cause
<what actually caused it>

### Detection
<how was it found, and how long did detection take>

### Impact
<what was affected, how many users/orgs, any data exposure>

### Containment and recovery steps taken
<what was done and when>

### What went well
<things that worked>

### What to improve
<gaps in detection, response, or process>

### Action items
| Item | Owner | Due |
|---|---|---|

Store postmortems in docs/audit/incidents/.

3. Specific Playbooks

3.1 Compromised secret

Identify which secret is compromised and in which environment.
Rotate immediately using the playbook in docs/security/secrets-management.md.
Audit access logs for the time window the secret was exposed.
Identify any actions taken using the compromised credential.
If customer data was potentially accessed: escalate to P1, notify affected orgs.
Remove the secret from any git history or logs where it appeared.
File postmortem.

3.2 Confirmed cross-org data access

Escalate to P1 immediately.
Pull audit logs for the affected time window:
- GET /api/v1/audit-logs (requires audit:read scope)
- Filter by the attacker's org ID and the victim's org ID
Identify exactly which records were accessed.
Notify affected org(s).
Determine root cause (code bug? compromised credential? config error?).
Patch, deploy, verify.
File postmortem.

3.3 Production outage (API down)

Check container status: docker compose ps.
Check logs: docker compose logs --tail=200 api.
Check DB: can you reach /ready? If not, the DB or Redis may be down.
If a deploy caused it: roll back (see docs/ops/production-runbook.md).
If infrastructure: contact the managed service provider.
Update status in Slack throughout.

3.4 Elevated 5xx rate (no full outage)

Check logs for the error pattern (Render logs or Sentry breadcrumbs).
Check Sentry — is it one endpoint or all?
Check if a recent deploy correlates.
If a deploy correlates: roll back.
If no deploy: investigate the specific error in logs (DB timeout? validation panic? upstream dep?).

3.5 Dependabot critical CVE

Review the CVE and assess whether the vulnerable code path is reachable in production.
If reachable: treat as P2 and patch within 24 hours.
If not reachable: treat as P3, patch within the next sprint.
Track in the vulnerability management log (docs/security/vulnerability-management.md).

4. On-Call and Escalation

Primary on-call: [assign name/role]
Secondary escalation: [assign name/role]
Alert channels: Sentry notifications, team Slack channel
All P1/P2 incidents must have a named incident commander within 15 minutes of detection.

docs/ops/production-runbook.md — rollback and deploy procedures
docs/security/secrets-management.md — secret rotation playbooks
docs/security/access-control-policy.md — account deprovision steps
docs/security/backup-restore-and-dr.md — restore from backup
docs/security/vulnerability-management.md — CVE triage and patch SLAs