Incident Response

Incident Response #

This runbook provides a structured approach to handling production incidents, from initial detection through resolution and post-incident review.

When to Use This Runbook #

  • Production service outage or degradation
  • Security incidents or suspected breaches
  • Data integrity issues
  • Customer-impacting bugs
  • System performance degradation
  • Third-party service failures affecting BonsAI
  • PagerDuty alert received

Incident Severity Levels #

Level Definition Response Time Examples
1 Critical - Complete service outage Immediate API down, database unavailable, security breach
2 High - Major functionality impaired 15 minutes Slow response times, login failures, data processing stopped
3 Medium - Partial functionality affected 1 hour Single feature broken, non-critical errors
4 Low - Minor issues, workaround available Next business day UI glitches, non-urgent bugs

Incident Response Phases #

Phase 1: Detection & Triage (0-5 minutes) #

Goal: Identify and assess the incident quickly

Step 1: Alert Received #

Incidents may be detected through:

  • PagerDuty - Primary incident alerting system
  • Monitoring alerts (Datadog, CloudWatch)
  • Customer reports (support tickets, Slack)
  • Error tracking (Sentry)
  • Team member observation

PagerDuty Integration:

When PagerDuty triggers an alert:

  1. You’ll receive notification via:
    • Phone call (for SEV1)
    • SMS
    • Push notification
    • Email
  2. Acknowledge the incident in PagerDuty immediately
  3. Create incident thread in Slack #eng-incident-prd
  4. Link the PagerDuty incident to the Slack thread

Step 2: Initial Assessment #

Quick questions to answer:
1. What is broken? (service, feature, user flow)
2. How many users are affected? (all, some, specific segment)
3. Is this a new issue or recurring?
4. When did it start? (use monitoring dashboards)
5. What changed recently? (deployments, config changes)

Gather initial evidence:

# Check service health
kubectl get pods
kubectl get pods --all-namespaces | grep -v Running

# Check recent deployments
kubectl rollout history deployment/bonsapi-deployment

# Check recent events
kubectl get events --sort-by='.lastTimestamp' | head -20

# Check error rates in Datadog
# Check Sentry for exception spikes

Step 3: Determine Severity #

Use the severity levels table above to classify the incident.

Severity Decision Tree:

Is production completely down? → SEV1
Is a critical user flow broken? → SEV2
Is a single feature affected with workaround? → SEV3
Is it a minor cosmetic issue? → SEV4

Phase 2: Response & Mitigation (5-30 minutes) #

Goal: Stop the bleeding, restore service

Step 1: Communicate #

For 1/2 incidents:

  1. Acknowledge in PagerDuty

    • Open PagerDuty incident
    • Click “Acknowledge” to stop alerts
    • Set yourself as incident commander
  2. Post in #eng-incident-prd channel

    #incident-YYYYMMDD-brief-description
    Example: #incident-20251021-api-down
    
  3. Post initial status

    🚨 INCIDENT DETECTED
    Severity: 2
    Issue: API returning 502 errors
    Impact: ~50% of API requests failing
    Started: 10:15 UTC
    Status: Investigating
    Incident Commander: @your-name
    PagerDuty: https://tofu-bonsai.pagerduty.com/incidents/[ID]
    
  4. Notify stakeholders

    • Engineering team (auto-notified via PagerDuty)
    • Customer success (if customer-facing)
    • Management (for SEV1)

For 3/4 incidents:

  • Post in #eng-incident-prd
  • Create ticket in Linear
  • No need for dedicated incident channel

Step 2: Assign Roles #

For major incidents (1/2):

  • Incident Commander (IC) - Coordinates response, makes decisions
  • Investigator(s) - Debug and identify root cause
  • Communicator - Update stakeholders, customers
  • Scribe - Document actions taken

Step 3: Investigate #

Use the appropriate runbook:

Symptom Runbook
API errors/downtime Monitoring & Alerting
Pods crashing Kubernetes Debugging
Database issues Database Access
Queue backlog RabbitMQ Management
Deployment issues Deployment Monitoring

Investigation checklist:

# 1. Check service status
kubectl get pods -o wide
kubectl get services
kubectl get ingress

# 2. Check pod logs
kubectl logs -l app=bonsapi --tail=100
kubectl logs -l app=bonsapi --tail=100 | grep -i error

# 3. Check events
kubectl get events --sort-by='.lastTimestamp' | head -30

# 4. Check resource usage
kubectl top nodes
kubectl top pods

# 5. Check external dependencies
# - Database connectivity
# - Redis connectivity
# - RabbitMQ status
# - Third-party APIs (Clerk, Stripe, etc.)

# 6. Review recent changes
git log --oneline --since="2 hours ago"
# Check GitHub Actions for recent deployments

Step 4: Implement Mitigation #

Mitigation strategies (in order of preference):

  1. Quick fix (if root cause is known and simple)

    # Example: Fix config issue
    kubectl edit configmap app-config
    kubectl rollout restart deployment/bonsapi-deployment
    
  2. Rollback (if caused by recent deployment)

    kubectl rollout undo deployment/bonsapi-deployment
    kubectl rollout status deployment/bonsapi-deployment
    
  3. Scale up (if resource exhaustion)

    kubectl scale deployment/bonsapi-deployment --replicas=5
    
  4. Disable feature (if feature causing issues)

    # Use feature flag to disable problematic feature
    # Update in Doppler or config
    
  5. Failover (if infrastructure issue)

    • Route traffic to backup region (if available)
    • Switch to backup database
    • Use cached data

Step 5: Verify Mitigation #

After implementing mitigation:

# 1. Check service health
kubectl get pods
kubectl logs <pod-name> --tail=50

# 2. Test affected functionality
curl https://api.gotofu.com/health
# Test specific endpoints that were broken

# 3. Monitor metrics
# - Error rate should drop to normal
# - Response time should improve
# - Request volume should recover

Update stakeholders:

✅ MITIGATION APPLIED
Action: Rolled back to previous version
Status: Service restored
Impact: API errors stopped, service operational
Time to mitigate: 18 minutes
Next steps: Root cause analysis

Update PagerDuty:

  1. Add note to PagerDuty incident with mitigation details
  2. Keep incident open until fully resolved
  3. Update incident status to “Acknowledged” (not “Resolved” yet)

Phase 3: Resolution & Recovery (30+ minutes) #

Goal: Fully resolve the issue and restore normal operations

Step 1: Confirm Full Recovery #

  • All metrics back to baseline
  • Error rates normal
  • No customer complaints
  • All dependent systems operational

Step 2: Identify Root Cause #

5 Whys Analysis:

Example:
Problem: API returned 502 errors

Why? → BonsAPI pods were not responding
Why? → Pods were out of memory (OOMKilled)
Why? → Memory leak in new code
Why? → Missing memory cleanup in document processing
Why? → Code review didn't catch the leak

Root Cause: Insufficient testing and code review for memory management

Step 3: Implement Permanent Fix #

  1. Create fix branch

    git checkout -b fix/incident-api-oom
    
  2. Implement fix

    • Fix the root cause
    • Add tests to prevent regression
    • Update documentation
  3. Test thoroughly

    mise run test
    mise run ci
    
  4. Deploy fix

    • Follow standard deployment process
    • Monitor closely during rollout

Step 4: Communicate Resolution #

🎉 INCIDENT RESOLVED
Issue: Memory leak in document processing
Root Cause: Missing cleanup in new feature
Fix: Deployed patch v2.1.3
Status: Fully resolved
Duration: 45 minutes
Customer impact: 50% of API requests affected
Follow-up: Post-incident review scheduled

Close PagerDuty Incident:

  1. Add final resolution note in PagerDuty
  2. Mark incident as “Resolved”
  3. Link to post-incident review document
  4. Ensure incident timeline is accurate in PagerDuty

Phase 4: Post-Incident Review (Within 48 hours) #

Goal: Learn from the incident and prevent recurrence

Step 1: Document Timeline #

Create incident report with:

  • Incident summary

    • What happened?
    • When did it happen?
    • Who was affected?
    • How was it detected?
  • Timeline of events

    10:15 UTC - Alert fired: High error rate
    10:17 UTC - Investigation started
    10:22 UTC - Root cause identified: OOM in pods
    10:25 UTC - Mitigation: Rollback initiated
    10:33 UTC - Service restored
    10:45 UTC - Permanent fix deployed
    11:00 UTC - Incident closed
    
  • Impact assessment

    • Users affected: ~1,000 active users
    • Duration: 45 minutes
    • Revenue impact: Estimated $X
    • API requests failed: ~50,000
  • Root cause analysis

    • Technical cause
    • Contributing factors
    • Why it wasn’t caught earlier

Step 2: Identify Action Items #

Categories:

  1. Immediate fixes (done during incident)
  2. Short-term improvements (within 1 week)
  3. Long-term improvements (within 1 month)

Example action items:

✅ COMPLETED
- Rolled back deployment
- Deployed memory leak fix
- Added monitoring for memory usage

🔄 IN PROGRESS (Week 1)
- Add memory profiling to CI pipeline
- Update code review checklist for memory management
- Improve alerting thresholds

📋 PLANNED (Month 1)
- Implement automatic rollback on OOM
- Add chaos engineering tests
- Improve runbook documentation

Step 3: Conduct Blameless Post-Mortem #

Meeting agenda:

  1. Review incident timeline (10 min)
  2. Discuss what went well (10 min)
    • Fast detection
    • Quick rollback
    • Good communication
  3. Discuss what could improve (20 min)
    • Earlier detection
    • Better testing
    • Improved monitoring
  4. Review action items (10 min)
  5. Assign owners and deadlines (10 min)

Key principles:

  • Blameless - Focus on systems, not individuals
  • Learning-focused - What can we learn?
  • Action-oriented - Concrete improvements
  • Documented - Share learnings with team

Common Incident Scenarios #

Scenario 1: Complete Service Outage #

Symptoms:

  • All API requests failing
  • Health checks failing
  • Pods not responding

Quick Actions:

# 1. Check pod status
kubectl get pods -o wide

# 2. Check recent deployments
kubectl rollout history deployment/bonsapi-deployment

# 3. Rollback immediately
kubectl rollout undo deployment/bonsapi-deployment

# 4. Monitor recovery
kubectl rollout status deployment/bonsapi-deployment

Scenario 2: Database Connection Issues #

Symptoms:

  • Database connection errors in logs
  • Timeouts on database queries
  • 500 errors from API

Quick Actions:

# 1. Check database status
aws rds describe-db-instances --db-instance-identifier bonsai-prod-db

# 2. Test connectivity from pod
kubectl exec -it <pod-name> -- nc -zv <db-host> 5432

# 3. Check connection pool
kubectl logs -l app=bonsapi | grep "connection pool"

# 4. Scale down if overwhelming DB
kubectl scale deployment/bonsapi-deployment --replicas=2

Scenario 3: Third-Party Service Failure #

Symptoms:

  • Errors calling external API (Clerk, Stripe, etc.)
  • Timeouts on authentication/payments
  • Partial functionality broken

Quick Actions:

  1. Verify third-party status

    • Check status page (status.clerk.com, status.stripe.com)
    • Test API directly
  2. Enable graceful degradation

    • Use cached data if available
    • Queue requests for retry
    • Show user-friendly error message
  3. Communicate with users

    • Status page update
    • In-app notification

Scenario 4: Security Incident #

Symptoms:

  • Suspicious activity in logs
  • Unexpected data access
  • Security alert from monitoring

CRITICAL ACTIONS:

  1. Contain the threat IMMEDIATELY

    # Isolate affected systems
    kubectl scale deployment/<affected-deployment> --replicas=0
    
    # Rotate potentially compromised credentials
    # See Secrets Management runbook
    
  2. Preserve evidence

    • Capture logs
    • Take snapshots
    • Document findings
  3. Escalate to security team

    • DO NOT discuss in public channels
    • Use secure communication
  4. Follow security incident protocol

    • Assess scope of breach
    • Notify affected parties (if required)
    • Comply with legal/regulatory requirements

Incident Communication #

Internal Communication #

Status updates every 15-30 minutes:

⏱️ STATUS UPDATE (10:30 UTC)
Current status: Still investigating
Progress: Identified high memory usage in API pods
Next steps: Implementing memory limit increase as temporary fix
ETA: 10 minutes

Customer Communication #

For customer-facing incidents:

  1. Initial acknowledgment (< 15 minutes)

    We're aware of an issue affecting API response times.
    Our team is investigating and will provide updates soon.
    
  2. Status updates (every 30-60 minutes)

    Update: We've identified the cause and are implementing a fix.
    Impact: ~50% of API requests experiencing delays.
    ETA: Service restoration expected within 30 minutes.
    
  3. Resolution notice

    Resolved: The API issue has been fixed and service is fully restored.
    We apologize for the inconvenience and are taking steps to prevent recurrence.
    

Incident Tools & Resources #

Useful Commands #

# Quick health check script
kubectl get pods -o wide && \
kubectl get deployments && \
kubectl get services && \
kubectl top nodes && \
kubectl top pods

# Tail logs from all services
kubectl logs -f -l app=bonsapi &
kubectl logs -f -l app=webapp &
kubectl logs -f -l app=bonsai-invoice &

# Watch events
watch kubectl get events --sort-by='.lastTimestamp'

Escalation Path #

When to Escalate #

  • Incident severity is SEV1
  • Mitigation attempts have failed
  • Issue requires specialized expertise
  • Legal or compliance concerns
  • Security incident

Escalation Contacts #

  1. Team Lead / Engineering Manager

    • For technical decisions
    • Resource allocation
    • Business impact assessment
  2. DevOps / Infrastructure Team

    • Infrastructure issues
    • AWS/Kubernetes expertise
    • Database administration
  3. Security Team

    • Security incidents
    • Data breaches
    • Compliance issues
  4. Customer Success / Support

    • Customer communication
    • Impact assessment
    • Workarounds

Best Practices #

During Incidents #

  • Stay calm - Clear thinking is crucial
  • Communicate clearly - Avoid jargon, be specific
  • Document everything - Actions, findings, decisions
  • Focus on mitigation first - Root cause analysis comes later
  • Don’t guess - Verify before implementing changes
  • Ask for help - Escalate early if stuck

Incident Prevention #

  • Monitor proactively - Don’t wait for customers to report
  • Test thoroughly - Catch issues before production
  • Deploy carefully - Use canary/blue-green deployments
  • Review regularly - Learn from past incidents
  • Practice - Run incident response drills

Post-Incident #

  • Close the loop - Complete all action items
  • Share learnings - Educate the team
  • Update runbooks - Improve documentation
  • Recognize effort - Thank the responders

See Also #