Incident Response #

This runbook provides a structured approach to handling production incidents, from initial detection through resolution and post-incident review.

When to Use This Runbook #

Production service outage or degradation
Security incidents or suspected breaches
Data integrity issues
Customer-impacting bugs
System performance degradation
Third-party service failures affecting BonsAI
PagerDuty alert received

Incident Severity Levels #

Level	Definition	Response Time	Examples
1	Critical - Complete service outage	Immediate	API down, database unavailable, security breach
2	High - Major functionality impaired	15 minutes	Slow response times, login failures, data processing stopped
3	Medium - Partial functionality affected	1 hour	Single feature broken, non-critical errors
4	Low - Minor issues, workaround available	Next business day	UI glitches, non-urgent bugs

Incident Response Phases #

Phase 1: Detection & Triage (0-5 minutes) #

Goal: Identify and assess the incident quickly

Step 1: Alert Received #

Incidents may be detected through:

PagerDuty - Primary incident alerting system
Monitoring alerts (Datadog, CloudWatch)
Customer reports (support tickets, Slack)
Error tracking (Sentry)
Team member observation

PagerDuty Integration:

When PagerDuty triggers an alert:

You’ll receive notification via:
- Phone call (for SEV1)
- SMS
- Push notification
- Email
Acknowledge the incident in PagerDuty immediately
Create incident thread in Slack #eng-incident-prd
Link the PagerDuty incident to the Slack thread

Step 2: Initial Assessment #

Quick questions to answer:
1. What is broken? (service, feature, user flow)
2. How many users are affected? (all, some, specific segment)
3. Is this a new issue or recurring?
4. When did it start? (use monitoring dashboards)
5. What changed recently? (deployments, config changes)

Gather initial evidence:

# Check service health
kubectl get pods
kubectl get pods --all-namespaces | grep -v Running

# Check recent deployments
kubectl rollout history deployment/bonsapi-deployment

# Check recent events
kubectl get events --sort-by='.lastTimestamp' | head -20

# Check error rates in Datadog
# Check Sentry for exception spikes

Step 3: Determine Severity #

Use the severity levels table above to classify the incident.

Severity Decision Tree:

Is production completely down? → SEV1
Is a critical user flow broken? → SEV2
Is a single feature affected with workaround? → SEV3
Is it a minor cosmetic issue? → SEV4

Phase 2: Response & Mitigation (5-30 minutes) #

Goal: Stop the bleeding, restore service

Step 1: Communicate #

For 1/2 incidents:

Acknowledge in PagerDuty
- Open PagerDuty incident
- Click “Acknowledge” to stop alerts
- Set yourself as incident commander

Post in #eng-incident-prd channel

#incident-YYYYMMDD-brief-description
Example: #incident-20251021-api-down

Post initial status

🚨 INCIDENT DETECTED
Severity: 2
Issue: API returning 502 errors
Impact: ~50% of API requests failing
Started: 10:15 UTC
Status: Investigating
Incident Commander: @your-name
PagerDuty: https://tofu-bonsai.pagerduty.com/incidents/[ID]

Notify stakeholders
- Engineering team (auto-notified via PagerDuty)
- Customer success (if customer-facing)
- Management (for SEV1)

For 3/4 incidents:

Post in #eng-incident-prd
Create ticket in Linear
No need for dedicated incident channel

Step 2: Assign Roles #

For major incidents (1/2):

Incident Commander (IC) - Coordinates response, makes decisions
Investigator(s) - Debug and identify root cause
Communicator - Update stakeholders, customers
Scribe - Document actions taken

Step 3: Investigate #

Use the appropriate runbook:

Symptom	Runbook
API errors/downtime	Monitoring & Alerting
Pods crashing	Kubernetes Debugging
Database issues	Database Access
Queue backlog	RabbitMQ Management
Deployment issues	Deployment Monitoring

Investigation checklist:

# 1. Check service status
kubectl get pods -o wide
kubectl get services
kubectl get ingress

# 2. Check pod logs
kubectl logs -l app=bonsapi --tail=100
kubectl logs -l app=bonsapi --tail=100 | grep -i error

# 3. Check events
kubectl get events --sort-by='.lastTimestamp' | head -30

# 4. Check resource usage
kubectl top nodes
kubectl top pods

# 5. Check external dependencies
# - Database connectivity
# - Redis connectivity
# - RabbitMQ status
# - Third-party APIs (Clerk, Stripe, etc.)

# 6. Review recent changes
git log --oneline --since="2 hours ago"
# Check GitHub Actions for recent deployments

Step 4: Implement Mitigation #

Mitigation strategies (in order of preference):

Quick fix (if root cause is known and simple)

# Example: Fix config issue
kubectl edit configmap app-config
kubectl rollout restart deployment/bonsapi-deployment

Rollback (if caused by recent deployment)

kubectl rollout undo deployment/bonsapi-deployment
kubectl rollout status deployment/bonsapi-deployment

Scale up (if resource exhaustion)

kubectl scale deployment/bonsapi-deployment --replicas=5

Disable feature (if feature causing issues)

# Use feature flag to disable problematic feature
# Update in Doppler or config

Failover (if infrastructure issue)
- Route traffic to backup region (if available)
- Switch to backup database
- Use cached data

Step 5: Verify Mitigation #

After implementing mitigation:

# 1. Check service health
kubectl get pods
kubectl logs <pod-name> --tail=50

# 2. Test affected functionality
curl https://api.gotofu.com/health
# Test specific endpoints that were broken

# 3. Monitor metrics
# - Error rate should drop to normal
# - Response time should improve
# - Request volume should recover

Update stakeholders:

✅ MITIGATION APPLIED
Action: Rolled back to previous version
Status: Service restored
Impact: API errors stopped, service operational
Time to mitigate: 18 minutes
Next steps: Root cause analysis

Update PagerDuty:

Add note to PagerDuty incident with mitigation details
Keep incident open until fully resolved
Update incident status to “Acknowledged” (not “Resolved” yet)

Phase 3: Resolution & Recovery (30+ minutes) #

Goal: Fully resolve the issue and restore normal operations

Step 1: Confirm Full Recovery #

All metrics back to baseline
Error rates normal
No customer complaints
All dependent systems operational

Step 2: Identify Root Cause #

5 Whys Analysis:

Example:
Problem: API returned 502 errors

Why? → BonsAPI pods were not responding
Why? → Pods were out of memory (OOMKilled)
Why? → Memory leak in new code
Why? → Missing memory cleanup in document processing
Why? → Code review didn't catch the leak

Root Cause: Insufficient testing and code review for memory management

Step 3: Implement Permanent Fix #

Create fix branch
```
git checkout -b fix/incident-api-oom
```
Implement fix
- Fix the root cause
- Add tests to prevent regression
- Update documentation
Test thoroughly
```
mise run test
mise run ci
```
Deploy fix
- Follow standard deployment process
- Monitor closely during rollout

Step 4: Communicate Resolution #

🎉 INCIDENT RESOLVED
Issue: Memory leak in document processing
Root Cause: Missing cleanup in new feature
Fix: Deployed patch v2.1.3
Status: Fully resolved
Duration: 45 minutes
Customer impact: 50% of API requests affected
Follow-up: Post-incident review scheduled

Close PagerDuty Incident:

Add final resolution note in PagerDuty
Mark incident as “Resolved”
Link to post-incident review document
Ensure incident timeline is accurate in PagerDuty

Phase 4: Post-Incident Review (Within 48 hours) #

Goal: Learn from the incident and prevent recurrence

Step 1: Document Timeline #

Create incident report with:

Incident summary
- What happened?
- When did it happen?
- Who was affected?
- How was it detected?

Timeline of events

10:15 UTC - Alert fired: High error rate
10:17 UTC - Investigation started
10:22 UTC - Root cause identified: OOM in pods
10:25 UTC - Mitigation: Rollback initiated
10:33 UTC - Service restored
10:45 UTC - Permanent fix deployed
11:00 UTC - Incident closed

Impact assessment
- Users affected: ~1,000 active users
- Duration: 45 minutes
- Revenue impact: Estimated $X
- API requests failed: ~50,000
Root cause analysis
- Technical cause
- Contributing factors
- Why it wasn’t caught earlier

Step 2: Identify Action Items #

Categories:

Immediate fixes (done during incident)
Short-term improvements (within 1 week)
Long-term improvements (within 1 month)

Example action items:

✅ COMPLETED
- Rolled back deployment
- Deployed memory leak fix
- Added monitoring for memory usage

🔄 IN PROGRESS (Week 1)
- Add memory profiling to CI pipeline
- Update code review checklist for memory management
- Improve alerting thresholds

📋 PLANNED (Month 1)
- Implement automatic rollback on OOM
- Add chaos engineering tests
- Improve runbook documentation

Step 3: Conduct Blameless Post-Mortem #

Meeting agenda:

Review incident timeline (10 min)
Discuss what went well (10 min)
- Fast detection
- Quick rollback
- Good communication
Discuss what could improve (20 min)
- Earlier detection
- Better testing
- Improved monitoring
Review action items (10 min)
Assign owners and deadlines (10 min)

Key principles:

Blameless - Focus on systems, not individuals
Learning-focused - What can we learn?
Action-oriented - Concrete improvements
Documented - Share learnings with team

Common Incident Scenarios #

Scenario 1: Complete Service Outage #

Symptoms:

All API requests failing
Health checks failing
Pods not responding

Quick Actions:

# 1. Check pod status
kubectl get pods -o wide

# 2. Check recent deployments
kubectl rollout history deployment/bonsapi-deployment

# 3. Rollback immediately
kubectl rollout undo deployment/bonsapi-deployment

# 4. Monitor recovery
kubectl rollout status deployment/bonsapi-deployment

Scenario 2: Database Connection Issues #

Symptoms:

Database connection errors in logs
Timeouts on database queries
500 errors from API

Quick Actions:

# 1. Check database status
aws rds describe-db-instances --db-instance-identifier bonsai-prod-db

# 2. Test connectivity from pod
kubectl exec -it <pod-name> -- nc -zv <db-host> 5432

# 3. Check connection pool
kubectl logs -l app=bonsapi | grep "connection pool"

# 4. Scale down if overwhelming DB
kubectl scale deployment/bonsapi-deployment --replicas=2

Scenario 3: Third-Party Service Failure #

Symptoms:

Errors calling external API (Clerk, Stripe, etc.)
Timeouts on authentication/payments
Partial functionality broken

Quick Actions:

Verify third-party status
- Check status page (status.clerk.com, status.stripe.com)
- Test API directly
Enable graceful degradation
- Use cached data if available
- Queue requests for retry
- Show user-friendly error message
Communicate with users
- Status page update
- In-app notification

Scenario 4: Security Incident #

Symptoms:

Suspicious activity in logs
Unexpected data access
Security alert from monitoring

CRITICAL ACTIONS:

Contain the threat IMMEDIATELY

# Isolate affected systems
kubectl scale deployment/<affected-deployment> --replicas=0

# Rotate potentially compromised credentials
# See Secrets Management runbook

Preserve evidence
- Capture logs
- Take snapshots
- Document findings
Escalate to security team
- DO NOT discuss in public channels
- Use secure communication
Follow security incident protocol
- Assess scope of breach
- Notify affected parties (if required)
- Comply with legal/regulatory requirements

Incident Communication #

Internal Communication #

Status updates every 15-30 minutes:

⏱️ STATUS UPDATE (10:30 UTC)
Current status: Still investigating
Progress: Identified high memory usage in API pods
Next steps: Implementing memory limit increase as temporary fix
ETA: 10 minutes

Customer Communication #

For customer-facing incidents:

Initial acknowledgment (< 15 minutes)

We're aware of an issue affecting API response times.
Our team is investigating and will provide updates soon.

Status updates (every 30-60 minutes)

Update: We've identified the cause and are implementing a fix.
Impact: ~50% of API requests experiencing delays.
ETA: Service restoration expected within 30 minutes.

Resolution notice

Resolved: The API issue has been fixed and service is fully restored.
We apologize for the inconvenience and are taking steps to prevent recurrence.

Incident Tools & Resources #

Quick Access Links #

GitHub Repository: https://github.com/tofu2-limited/bonsai
GitHub Actions: https://github.com/tofu2-limited/bonsai/actions
Datadog Dashboard: https://app.datadoghq.com
AWS Console: https://console.aws.amazon.com
Linear: https://linear.app

Useful Commands #

# Quick health check script
kubectl get pods -o wide && \
kubectl get deployments && \
kubectl get services && \
kubectl top nodes && \
kubectl top pods

# Tail logs from all services
kubectl logs -f -l app=bonsapi &
kubectl logs -f -l app=webapp &
kubectl logs -f -l app=bonsai-invoice &

# Watch events
watch kubectl get events --sort-by='.lastTimestamp'

Escalation Path #

When to Escalate #

Incident severity is SEV1
Mitigation attempts have failed
Issue requires specialized expertise
Legal or compliance concerns
Security incident

Escalation Contacts #

Team Lead / Engineering Manager
- For technical decisions
- Resource allocation
- Business impact assessment
DevOps / Infrastructure Team
- Infrastructure issues
- AWS/Kubernetes expertise
- Database administration
Security Team
- Security incidents
- Data breaches
- Compliance issues
Customer Success / Support
- Customer communication
- Impact assessment
- Workarounds

Best Practices #

During Incidents #

Stay calm - Clear thinking is crucial
Communicate clearly - Avoid jargon, be specific
Document everything - Actions, findings, decisions
Focus on mitigation first - Root cause analysis comes later
Don’t guess - Verify before implementing changes
Ask for help - Escalate early if stuck

Incident Prevention #

Monitor proactively - Don’t wait for customers to report
Test thoroughly - Catch issues before production
Deploy carefully - Use canary/blue-green deployments
Review regularly - Learn from past incidents
Practice - Run incident response drills

Post-Incident #

Close the loop - Complete all action items
Share learnings - Educate the team
Update runbooks - Improve documentation
Recognize effort - Thank the responders

Incident Response #

When to Use This Runbook #

Incident Severity Levels #

Incident Response Phases #

Phase 1: Detection & Triage (0-5 minutes) #

Step 1: Alert Received #

Step 2: Initial Assessment #

Step 3: Determine Severity #

Phase 2: Response & Mitigation (5-30 minutes) #

Step 1: Communicate #

Step 2: Assign Roles #

Step 3: Investigate #

Step 4: Implement Mitigation #

Step 5: Verify Mitigation #

Phase 3: Resolution & Recovery (30+ minutes) #

Step 1: Confirm Full Recovery #

Step 2: Identify Root Cause #

Step 3: Implement Permanent Fix #

Step 4: Communicate Resolution #

Phase 4: Post-Incident Review (Within 48 hours) #

Step 1: Document Timeline #

Step 2: Identify Action Items #

Step 3: Conduct Blameless Post-Mortem #

Common Incident Scenarios #

Scenario 1: Complete Service Outage #

Scenario 2: Database Connection Issues #

Scenario 3: Third-Party Service Failure #

Scenario 4: Security Incident #

Incident Communication #

Internal Communication #

Customer Communication #

Incident Tools & Resources #

Quick Access Links #

Useful Commands #

Escalation Path #

When to Escalate #

Escalation Contacts #

Best Practices #

During Incidents #

Incident Prevention #

Post-Incident #

See Also #