Incident Response
Owner: Backend Lead / On-call Last Edited: March 26, 2026 Last Reviewed: March 26, 2026
1. Severity Levels
| Severity | Definition | Response target | Example |
|---|---|---|---|
| P1 — Critical | Production down or confirmed data breach | Immediate (< 15 min to triage) | API returning 500 for all requests; confirmed cross-org data leak; secret confirmed compromised in prod |
| P2 — High | Significant degradation or credible security threat | < 1 hour to triage | Elevated 5xx rate; repeated auth failures from a single source; Dependabot critical vuln in prod dep |
| P3 — Medium | Limited impact; workaround available | < 4 hours | Single endpoint misbehaving; Slack alerts not routing; staging environment issue |
| P4 — Low | Informational or minor | Next business day | Dependency alert (non-critical); doc drift; non-urgent config cleanup |
2. Incident Lifecycle
Detection → Triage → Containment → Eradication → Recovery → Postmortem
2.1 Detection
Incidents may be detected via:
- Sentry alert (error spike, new error group, performance degradation)
- Render service metrics anomaly (CPU, memory, restart)
- External report (user, partner, security researcher)
- GitHub secret scanning alert
- Dependabot critical CVE alert
- Manual observation
2.2 Triage (first 15 min for P1)
- Acknowledge the alert (Sentry notification or team Slack channel).
- Determine severity using the table above.
- Check
/healthand/ready— if either is down, the outage is infrastructure-level. - Check Render logs for the backend service.
- Check Sentry for the time window of the anomaly (error groups, breadcrumbs, performance traces).
- Identify: is this a production outage, a security incident, or both?
2.3 Containment
Production outage:
- If a bad deploy caused it: roll back immediately (see
docs/ops/production-runbook.mdSection 8). - If a dependency (DB, Redis) is down: notify the managed service provider and monitor.
Security incident (compromised secret, data exposure, auth bypass):
- Revoke the affected credential immediately. Rotate it in the upstream service and update the value in Infisical (which auto-syncs to Render/Cloudflare). See
docs/security/secrets-management.md. - If a user account is compromised: deprovision in Clerk immediately.
- If cross-org data was accessed: identify affected org IDs from audit logs.
- Do not delete logs — preserve evidence before any cleanup.
Active attack (repeated auth failures, rate limit abuse):
- Block the source IP at the Cloudflare/edge level.
- Do not block at the application level only — the edge block is faster and more reliable.
2.4 Eradication
- Patch or revert the root cause (bad deploy, vulnerable dep, misconfiguration).
- Rotate all secrets that may have been exposed.
- Remove malicious access (revoke API keys, deprovision compromised accounts).
2.5 Recovery
- Re-deploy the clean/patched version.
- Verify with
/health,/ready, and k6 smoke gate. - Confirm monitoring and alerts are back to baseline.
- Notify affected parties if data was exposed.
2.6 Postmortem
Every P1 and P2 incident requires a postmortem within 5 business days.
Postmortem template:
## Incident: <title>
Date: YYYY-MM-DD
Severity: P1 / P2
Duration: <start> to <end>
Author: <name>
### What happened
<timeline of events>
### Root cause
<what actually caused it>
### Detection
<how was it found, and how long did detection take>
### Impact
<what was affected, how many users/orgs, any data exposure>
### Containment and recovery steps taken
<what was done and when>
### What went well
<things that worked>
### What to improve
<gaps in detection, response, or process>
### Action items
| Item | Owner | Due |
|---|---|---|
Store postmortems in docs/audit/incidents/.
3. Specific Playbooks
3.1 Compromised secret
- Identify which secret is compromised and in which environment.
- Rotate immediately using the playbook in
docs/security/secrets-management.md. - Audit access logs for the time window the secret was exposed.
- Identify any actions taken using the compromised credential.
- If customer data was potentially accessed: escalate to P1, notify affected orgs.
- Remove the secret from any git history or logs where it appeared.
- File postmortem.
3.2 Confirmed cross-org data access
- Escalate to P1 immediately.
- Pull audit logs for the affected time window:
GET /api/v1/audit-logs(requiresaudit:readscope)- Filter by the attacker's org ID and the victim's org ID
- Identify exactly which records were accessed.
- Notify affected org(s).
- Determine root cause (code bug? compromised credential? config error?).
- Patch, deploy, verify.
- File postmortem.
3.3 Production outage (API down)
- Check container status:
docker compose ps. - Check logs:
docker compose logs --tail=200 api. - Check DB: can you reach
/ready? If not, the DB or Redis may be down. - If a deploy caused it: roll back (see
docs/ops/production-runbook.md). - If infrastructure: contact the managed service provider.
- Update status in Slack throughout.
3.4 Elevated 5xx rate (no full outage)
- Check logs for the error pattern (Render logs or Sentry breadcrumbs).
- Check Sentry — is it one endpoint or all?
- Check if a recent deploy correlates.
- If a deploy correlates: roll back.
- If no deploy: investigate the specific error in logs (DB timeout? validation panic? upstream dep?).
3.5 Dependabot critical CVE
- Review the CVE and assess whether the vulnerable code path is reachable in production.
- If reachable: treat as P2 and patch within 24 hours.
- If not reachable: treat as P3, patch within the next sprint.
- Track in the vulnerability management log (
docs/security/vulnerability-management.md).
4. On-Call and Escalation
- Primary on-call: [assign name/role]
- Secondary escalation: [assign name/role]
- Alert channels: Sentry notifications, team Slack channel
- All P1/P2 incidents must have a named incident commander within 15 minutes of detection.
5. Related Docs
docs/ops/production-runbook.md— rollback and deploy proceduresdocs/security/secrets-management.md— secret rotation playbooksdocs/security/access-control-policy.md— account deprovision stepsdocs/security/backup-restore-and-dr.md— restore from backupdocs/security/vulnerability-management.md— CVE triage and patch SLAs