Incident Response #
This runbook provides a structured approach to handling production incidents, from initial detection through resolution and post-incident review.
When to Use This Runbook #
- Production service outage or degradation
- Security incidents or suspected breaches
- Data integrity issues
- Customer-impacting bugs
- System performance degradation
- Third-party service failures affecting BonsAI
- PagerDuty alert received
Incident Severity Levels #
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| 1 | Critical - Complete service outage | Immediate | API down, database unavailable, security breach |
| 2 | High - Major functionality impaired | 15 minutes | Slow response times, login failures, data processing stopped |
| 3 | Medium - Partial functionality affected | 1 hour | Single feature broken, non-critical errors |
| 4 | Low - Minor issues, workaround available | Next business day | UI glitches, non-urgent bugs |
Incident Response Phases #
Phase 1: Detection & Triage (0-5 minutes) #
Goal: Identify and assess the incident quickly
Step 1: Alert Received #
Incidents may be detected through:
- PagerDuty - Primary incident alerting system
- Monitoring alerts (Datadog, CloudWatch)
- Customer reports (support tickets, Slack)
- Error tracking (Sentry)
- Team member observation
PagerDuty Integration:
When PagerDuty triggers an alert:
- You’ll receive notification via:
- Phone call (for SEV1)
- SMS
- Push notification
- Acknowledge the incident in PagerDuty immediately
- Create incident thread in Slack #eng-incident-prd
- Link the PagerDuty incident to the Slack thread
Step 2: Initial Assessment #
Quick questions to answer:
1. What is broken? (service, feature, user flow)
2. How many users are affected? (all, some, specific segment)
3. Is this a new issue or recurring?
4. When did it start? (use monitoring dashboards)
5. What changed recently? (deployments, config changes)
Gather initial evidence:
# Check service health
kubectl get pods
kubectl get pods --all-namespaces | grep -v Running
# Check recent deployments
kubectl rollout history deployment/bonsapi-deployment
# Check recent events
kubectl get events --sort-by='.lastTimestamp' | head -20
# Check error rates in Datadog
# Check Sentry for exception spikes
Step 3: Determine Severity #
Use the severity levels table above to classify the incident.
Severity Decision Tree:
Is production completely down? → SEV1
Is a critical user flow broken? → SEV2
Is a single feature affected with workaround? → SEV3
Is it a minor cosmetic issue? → SEV4
Phase 2: Response & Mitigation (5-30 minutes) #
Goal: Stop the bleeding, restore service
Step 1: Communicate #
For 1/2 incidents:
-
Acknowledge in PagerDuty
- Open PagerDuty incident
- Click “Acknowledge” to stop alerts
- Set yourself as incident commander
-
Post in #eng-incident-prd channel
#incident-YYYYMMDD-brief-description Example: #incident-20251021-api-down -
Post initial status
🚨 INCIDENT DETECTED Severity: 2 Issue: API returning 502 errors Impact: ~50% of API requests failing Started: 10:15 UTC Status: Investigating Incident Commander: @your-name PagerDuty: https://tofu-bonsai.pagerduty.com/incidents/[ID] -
Notify stakeholders
- Engineering team (auto-notified via PagerDuty)
- Customer success (if customer-facing)
- Management (for SEV1)
For 3/4 incidents:
- Post in #eng-incident-prd
- Create ticket in Linear
- No need for dedicated incident channel
Step 2: Assign Roles #
For major incidents (1/2):
- Incident Commander (IC) - Coordinates response, makes decisions
- Investigator(s) - Debug and identify root cause
- Communicator - Update stakeholders, customers
- Scribe - Document actions taken
Step 3: Investigate #
Use the appropriate runbook:
| Symptom | Runbook |
|---|---|
| API errors/downtime | Monitoring & Alerting |
| Pods crashing | Kubernetes Debugging |
| Database issues | Database Access |
| Queue backlog | RabbitMQ Management |
| Deployment issues | Deployment Monitoring |
Investigation checklist:
# 1. Check service status
kubectl get pods -o wide
kubectl get services
kubectl get ingress
# 2. Check pod logs
kubectl logs -l app=bonsapi --tail=100
kubectl logs -l app=bonsapi --tail=100 | grep -i error
# 3. Check events
kubectl get events --sort-by='.lastTimestamp' | head -30
# 4. Check resource usage
kubectl top nodes
kubectl top pods
# 5. Check external dependencies
# - Database connectivity
# - Redis connectivity
# - RabbitMQ status
# - Third-party APIs (Clerk, Stripe, etc.)
# 6. Review recent changes
git log --oneline --since="2 hours ago"
# Check GitHub Actions for recent deployments
Step 4: Implement Mitigation #
Mitigation strategies (in order of preference):
-
Quick fix (if root cause is known and simple)
# Example: Fix config issue kubectl edit configmap app-config kubectl rollout restart deployment/bonsapi-deployment -
Rollback (if caused by recent deployment)
kubectl rollout undo deployment/bonsapi-deployment kubectl rollout status deployment/bonsapi-deployment -
Scale up (if resource exhaustion)
kubectl scale deployment/bonsapi-deployment --replicas=5 -
Disable feature (if feature causing issues)
# Use feature flag to disable problematic feature # Update in Doppler or config -
Failover (if infrastructure issue)
- Route traffic to backup region (if available)
- Switch to backup database
- Use cached data
Step 5: Verify Mitigation #
After implementing mitigation:
# 1. Check service health
kubectl get pods
kubectl logs <pod-name> --tail=50
# 2. Test affected functionality
curl https://api.gotofu.com/health
# Test specific endpoints that were broken
# 3. Monitor metrics
# - Error rate should drop to normal
# - Response time should improve
# - Request volume should recover
Update stakeholders:
✅ MITIGATION APPLIED
Action: Rolled back to previous version
Status: Service restored
Impact: API errors stopped, service operational
Time to mitigate: 18 minutes
Next steps: Root cause analysis
Update PagerDuty:
- Add note to PagerDuty incident with mitigation details
- Keep incident open until fully resolved
- Update incident status to “Acknowledged” (not “Resolved” yet)
Phase 3: Resolution & Recovery (30+ minutes) #
Goal: Fully resolve the issue and restore normal operations
Step 1: Confirm Full Recovery #
- All metrics back to baseline
- Error rates normal
- No customer complaints
- All dependent systems operational
Step 2: Identify Root Cause #
5 Whys Analysis:
Example:
Problem: API returned 502 errors
Why? → BonsAPI pods were not responding
Why? → Pods were out of memory (OOMKilled)
Why? → Memory leak in new code
Why? → Missing memory cleanup in document processing
Why? → Code review didn't catch the leak
Root Cause: Insufficient testing and code review for memory management
Step 3: Implement Permanent Fix #
-
Create fix branch
git checkout -b fix/incident-api-oom -
Implement fix
- Fix the root cause
- Add tests to prevent regression
- Update documentation
-
Test thoroughly
mise run test mise run ci -
Deploy fix
- Follow standard deployment process
- Monitor closely during rollout
Step 4: Communicate Resolution #
🎉 INCIDENT RESOLVED
Issue: Memory leak in document processing
Root Cause: Missing cleanup in new feature
Fix: Deployed patch v2.1.3
Status: Fully resolved
Duration: 45 minutes
Customer impact: 50% of API requests affected
Follow-up: Post-incident review scheduled
Close PagerDuty Incident:
- Add final resolution note in PagerDuty
- Mark incident as “Resolved”
- Link to post-incident review document
- Ensure incident timeline is accurate in PagerDuty
Phase 4: Post-Incident Review (Within 48 hours) #
Goal: Learn from the incident and prevent recurrence
Step 1: Document Timeline #
Create incident report with:
-
Incident summary
- What happened?
- When did it happen?
- Who was affected?
- How was it detected?
-
Timeline of events
10:15 UTC - Alert fired: High error rate 10:17 UTC - Investigation started 10:22 UTC - Root cause identified: OOM in pods 10:25 UTC - Mitigation: Rollback initiated 10:33 UTC - Service restored 10:45 UTC - Permanent fix deployed 11:00 UTC - Incident closed -
Impact assessment
- Users affected: ~1,000 active users
- Duration: 45 minutes
- Revenue impact: Estimated $X
- API requests failed: ~50,000
-
Root cause analysis
- Technical cause
- Contributing factors
- Why it wasn’t caught earlier
Step 2: Identify Action Items #
Categories:
- Immediate fixes (done during incident)
- Short-term improvements (within 1 week)
- Long-term improvements (within 1 month)
Example action items:
✅ COMPLETED
- Rolled back deployment
- Deployed memory leak fix
- Added monitoring for memory usage
🔄 IN PROGRESS (Week 1)
- Add memory profiling to CI pipeline
- Update code review checklist for memory management
- Improve alerting thresholds
📋 PLANNED (Month 1)
- Implement automatic rollback on OOM
- Add chaos engineering tests
- Improve runbook documentation
Step 3: Conduct Blameless Post-Mortem #
Meeting agenda:
- Review incident timeline (10 min)
- Discuss what went well (10 min)
- Fast detection
- Quick rollback
- Good communication
- Discuss what could improve (20 min)
- Earlier detection
- Better testing
- Improved monitoring
- Review action items (10 min)
- Assign owners and deadlines (10 min)
Key principles:
- Blameless - Focus on systems, not individuals
- Learning-focused - What can we learn?
- Action-oriented - Concrete improvements
- Documented - Share learnings with team
Common Incident Scenarios #
Scenario 1: Complete Service Outage #
Symptoms:
- All API requests failing
- Health checks failing
- Pods not responding
Quick Actions:
# 1. Check pod status
kubectl get pods -o wide
# 2. Check recent deployments
kubectl rollout history deployment/bonsapi-deployment
# 3. Rollback immediately
kubectl rollout undo deployment/bonsapi-deployment
# 4. Monitor recovery
kubectl rollout status deployment/bonsapi-deployment
Scenario 2: Database Connection Issues #
Symptoms:
- Database connection errors in logs
- Timeouts on database queries
- 500 errors from API
Quick Actions:
# 1. Check database status
aws rds describe-db-instances --db-instance-identifier bonsai-prod-db
# 2. Test connectivity from pod
kubectl exec -it <pod-name> -- nc -zv <db-host> 5432
# 3. Check connection pool
kubectl logs -l app=bonsapi | grep "connection pool"
# 4. Scale down if overwhelming DB
kubectl scale deployment/bonsapi-deployment --replicas=2
Scenario 3: Third-Party Service Failure #
Symptoms:
- Errors calling external API (Clerk, Stripe, etc.)
- Timeouts on authentication/payments
- Partial functionality broken
Quick Actions:
-
Verify third-party status
- Check status page (status.clerk.com, status.stripe.com)
- Test API directly
-
Enable graceful degradation
- Use cached data if available
- Queue requests for retry
- Show user-friendly error message
-
Communicate with users
- Status page update
- In-app notification
Scenario 4: Security Incident #
Symptoms:
- Suspicious activity in logs
- Unexpected data access
- Security alert from monitoring
CRITICAL ACTIONS:
-
Contain the threat IMMEDIATELY
# Isolate affected systems kubectl scale deployment/<affected-deployment> --replicas=0 # Rotate potentially compromised credentials # See Secrets Management runbook -
Preserve evidence
- Capture logs
- Take snapshots
- Document findings
-
Escalate to security team
- DO NOT discuss in public channels
- Use secure communication
-
Follow security incident protocol
- Assess scope of breach
- Notify affected parties (if required)
- Comply with legal/regulatory requirements
Incident Communication #
Internal Communication #
Status updates every 15-30 minutes:
⏱️ STATUS UPDATE (10:30 UTC)
Current status: Still investigating
Progress: Identified high memory usage in API pods
Next steps: Implementing memory limit increase as temporary fix
ETA: 10 minutes
Customer Communication #
For customer-facing incidents:
-
Initial acknowledgment (< 15 minutes)
We're aware of an issue affecting API response times. Our team is investigating and will provide updates soon. -
Status updates (every 30-60 minutes)
Update: We've identified the cause and are implementing a fix. Impact: ~50% of API requests experiencing delays. ETA: Service restoration expected within 30 minutes. -
Resolution notice
Resolved: The API issue has been fixed and service is fully restored. We apologize for the inconvenience and are taking steps to prevent recurrence.
Incident Tools & Resources #
Quick Access Links #
- GitHub Repository: https://github.com/tofu2-limited/bonsai
- GitHub Actions: https://github.com/tofu2-limited/bonsai/actions
- Datadog Dashboard: https://app.datadoghq.com
- AWS Console: https://console.aws.amazon.com
- Linear: https://linear.app
Useful Commands #
# Quick health check script
kubectl get pods -o wide && \
kubectl get deployments && \
kubectl get services && \
kubectl top nodes && \
kubectl top pods
# Tail logs from all services
kubectl logs -f -l app=bonsapi &
kubectl logs -f -l app=webapp &
kubectl logs -f -l app=bonsai-invoice &
# Watch events
watch kubectl get events --sort-by='.lastTimestamp'
Escalation Path #
When to Escalate #
- Incident severity is SEV1
- Mitigation attempts have failed
- Issue requires specialized expertise
- Legal or compliance concerns
- Security incident
Escalation Contacts #
-
Team Lead / Engineering Manager
- For technical decisions
- Resource allocation
- Business impact assessment
-
DevOps / Infrastructure Team
- Infrastructure issues
- AWS/Kubernetes expertise
- Database administration
-
Security Team
- Security incidents
- Data breaches
- Compliance issues
-
Customer Success / Support
- Customer communication
- Impact assessment
- Workarounds
Best Practices #
During Incidents #
- Stay calm - Clear thinking is crucial
- Communicate clearly - Avoid jargon, be specific
- Document everything - Actions, findings, decisions
- Focus on mitigation first - Root cause analysis comes later
- Don’t guess - Verify before implementing changes
- Ask for help - Escalate early if stuck
Incident Prevention #
- Monitor proactively - Don’t wait for customers to report
- Test thoroughly - Catch issues before production
- Deploy carefully - Use canary/blue-green deployments
- Review regularly - Learn from past incidents
- Practice - Run incident response drills
Post-Incident #
- Close the loop - Complete all action items
- Share learnings - Educate the team
- Update runbooks - Improve documentation
- Recognize effort - Thank the responders
See Also #
- Monitoring & Alerting - Detection and alerting
- Kubernetes Debugging - Pod troubleshooting
- Database Access - Database incidents
- RabbitMQ Management - Queue incidents
- Deployment Monitoring - Deployment issues
- Secrets Management - Credential rotation