Operations Runbooks

Operations Runbooks #

Comprehensive operational guides for troubleshooting and managing the BonsAI platform in production. These runbooks are designed for both first-time operators and experienced team members responding to production issues.

When to Use These Runbooks #

  • Production incidents - When alerts fire or users report issues
  • Routine operations - Regular maintenance and monitoring tasks
  • Troubleshooting - Investigating performance or functionality problems
  • Emergency procedures - Critical issues requiring immediate action
  • Knowledge transfer - Onboarding new team members to operations

Available Runbooks #

Monitoring & Observability #

Infrastructure Access #

Service Management #

Deployment & CI/CD #

Emergency Response #

General Principles #

Safety First #

  • Always work in a non-production environment first when possible
  • Use read-only queries before making changes
  • Take backups before destructive operations
  • Document all production changes
  • Get peer review for risky operations

Access Control #

Production access requires:

  • PagerDuty account for on-call alerts
  • AWS SSO configuration for cloud resources
  • Tailscale VPN for database access (when configured)
  • Doppler access for secrets management
  • Appropriate AWS roles and permissions

Communication #

During incidents:

  • Acknowledge PagerDuty alert immediately
  • Create thread in #eng-incident-prd Slack channel
  • Document all steps in incident thread
  • Update PagerDuty incident with progress notes
  • Use Datadog/CloudWatch dashboards to share context
  • Close PagerDuty incident when resolved
  • Schedule post-incident review within 48 hours

Need Help? #

If you’re dealing with a critical incident and these runbooks aren’t sufficient:

  1. Escalate immediately - Contact Ken Kanai
  2. Check status pages - Verify third-party service status (AWS, Datadog, Clerk, etc.)
  3. Gather context - Collect logs, metrics, and error messages before escalating
  4. Document everything - Keep notes of what you’ve tried and observed

Remember: It’s always better to ask for help than to make the situation worse. When in doubt, escalate early.