Operations Runbooks #

Comprehensive operational guides for troubleshooting and managing the BonsAI platform in production. These runbooks are designed for both first-time operators and experienced team members responding to production issues.

When to Use These Runbooks #

Production incidents - When alerts fire or users report issues
Routine operations - Regular maintenance and monitoring tasks
Troubleshooting - Investigating performance or functionality problems
Emergency procedures - Critical issues requiring immediate action
Knowledge transfer - Onboarding new team members to operations

Available Runbooks #

Monitoring & Observability #

Monitoring & Alerting - PagerDuty, Datadog, Sentry, CloudWatch setup and alert interpretation
Log Aggregation & Search - Finding and analyzing logs with trace IDs across all services
LLM Observability - Monitoring AI/ML operations, token usage, and LLM performance

Infrastructure Access #

Kubernetes Debugging - kubectl commands, pod inspection, and cluster troubleshooting
Database Access - Connecting to production databases safely via Tailscale and CloudBeaver
Secrets Management - Using Doppler to view and update environment variables

Service Management #

RabbitMQ Management - Queue monitoring, message inspection, and queue management
Service Health Checks - Verifying system health across all components

Deployment & CI/CD #

Deployment Monitoring - Tracking deployments via GitHub Actions and Kubernetes

Emergency Response #

Incident Response - Step-by-step incident management with PagerDuty integration

General Principles #

Safety First #

Always work in a non-production environment first when possible
Use read-only queries before making changes
Take backups before destructive operations
Document all production changes
Get peer review for risky operations

Access Control #

Production access requires:

PagerDuty account for on-call alerts
AWS SSO configuration for cloud resources
Tailscale VPN for database access (when configured)
Doppler access for secrets management
Appropriate AWS roles and permissions

Communication #

During incidents:

Acknowledge PagerDuty alert immediately
Create thread in #eng-incident-prd Slack channel
Document all steps in incident thread
Update PagerDuty incident with progress notes
Use Datadog/CloudWatch dashboards to share context
Close PagerDuty incident when resolved
Schedule post-incident review within 48 hours

Infrastructure Overview - AWS architecture and components
Kubernetes Setup - EKS cluster configuration
Database Management - Schema and migration procedures
Development Workflow - How code reaches production

Need Help? #

If you’re dealing with a critical incident and these runbooks aren’t sufficient:

Escalate immediately - Contact Ken Kanai
Check status pages - Verify third-party service status (AWS, Datadog, Clerk, etc.)
Gather context - Collect logs, metrics, and error messages before escalating
Document everything - Keep notes of what you’ve tried and observed

Remember: It’s always better to ask for help than to make the situation worse. When in doubt, escalate early.