Operations Runbooks #
Comprehensive operational guides for troubleshooting and managing the BonsAI platform in production. These runbooks are designed for both first-time operators and experienced team members responding to production issues.
When to Use These Runbooks #
- Production incidents - When alerts fire or users report issues
- Routine operations - Regular maintenance and monitoring tasks
- Troubleshooting - Investigating performance or functionality problems
- Emergency procedures - Critical issues requiring immediate action
- Knowledge transfer - Onboarding new team members to operations
Available Runbooks #
Monitoring & Observability #
- Monitoring & Alerting - PagerDuty, Datadog, Sentry, CloudWatch setup and alert interpretation
- Log Aggregation & Search - Finding and analyzing logs with trace IDs across all services
- LLM Observability - Monitoring AI/ML operations, token usage, and LLM performance
Infrastructure Access #
- Kubernetes Debugging - kubectl commands, pod inspection, and cluster troubleshooting
- Database Access - Connecting to production databases safely via Tailscale and CloudBeaver
- Secrets Management - Using Doppler to view and update environment variables
Service Management #
- RabbitMQ Management - Queue monitoring, message inspection, and queue management
- Service Health Checks - Verifying system health across all components
Deployment & CI/CD #
- Deployment Monitoring - Tracking deployments via GitHub Actions and Kubernetes
Emergency Response #
- Incident Response - Step-by-step incident management with PagerDuty integration
General Principles #
Safety First #
- Always work in a non-production environment first when possible
- Use read-only queries before making changes
- Take backups before destructive operations
- Document all production changes
- Get peer review for risky operations
Access Control #
Production access requires:
- PagerDuty account for on-call alerts
- AWS SSO configuration for cloud resources
- Tailscale VPN for database access (when configured)
- Doppler access for secrets management
- Appropriate AWS roles and permissions
Communication #
During incidents:
- Acknowledge PagerDuty alert immediately
- Create thread in #eng-incident-prd Slack channel
- Document all steps in incident thread
- Update PagerDuty incident with progress notes
- Use Datadog/CloudWatch dashboards to share context
- Close PagerDuty incident when resolved
- Schedule post-incident review within 48 hours
Related Documentation #
- Infrastructure Overview - AWS architecture and components
- Kubernetes Setup - EKS cluster configuration
- Database Management - Schema and migration procedures
- Development Workflow - How code reaches production
Need Help? #
If you’re dealing with a critical incident and these runbooks aren’t sufficient:
- Escalate immediately - Contact Ken Kanai
- Check status pages - Verify third-party service status (AWS, Datadog, Clerk, etc.)
- Gather context - Collect logs, metrics, and error messages before escalating
- Document everything - Keep notes of what you’ve tried and observed
Remember: It’s always better to ask for help than to make the situation worse. When in doubt, escalate early.