Monitoring & Alerting #
This runbook covers how to use BonsAI’s monitoring and alerting systems to identify and diagnose production issues.
When to Use This Runbook #
- You received an alert notification (PagerDuty, email, Slack)
- Investigating reported performance issues
- Proactive health monitoring
- Post-incident analysis and metrics review
- Setting up new alerts
Overview of Monitoring Stack #
BonsAI uses a multi-layered monitoring approach:
| Tool | Purpose | Access |
|---|---|---|
| PagerDuty | Incident alerting and on-call management | tofu-bonsai.pagerduty.com |
| Datadog | Metrics, APM, logs, dashboards | datadoghq.com |
| Sentry | Error tracking and exception monitoring | Check webapp configuration |
| CloudWatch | AWS infrastructure metrics and logs | AWS Console |
| OpenTelemetry | Metric collection and forwarding | Deployed in EKS cluster |
PagerDuty #
Primary Incident Alerting System #
PagerDuty is our primary alerting system for production incidents. It ensures critical alerts reach the on-call engineer immediately.
Access:
- URL:
https://tofu-bonsai.pagerduty.com - Download mobile app for on-call notifications
- Configure notification preferences (phone, SMS, push)
Alert Routing #
Critical Alert (SEV1)
↓
PagerDuty Incident Created
↓
┌──────────────┬──────────────┬──────────────┐
│ Phone Call │ SMS │ Push Notif │
│ (immediate) │ (immediate) │ (immediate) │
└──────────────┴──────────────┴──────────────┘
↓
On-call Engineer Acknowledges
↓
Alerts stop, incident tracked
PagerDuty Integration #
Datadog → PagerDuty:
- Critical alerts in Datadog trigger PagerDuty incidents
- Configured in Datadog monitor settings
- Includes alert context and links
When You Receive a PagerDuty Alert:
-
Review the incident
- Check incident details in PagerDuty app/web
- Note severity, service, and description
- Review attached metrics/graphs
-
Acknowledge immediately
- Stops alert notifications
- Marks you as incident commander
- Starts incident timeline
-
Investigate
- Click through to Datadog dashboard
- Follow incident response runbook
- See Incident Response
-
Update PagerDuty
- Add notes throughout investigation
- Update status (Acknowledged → Resolved)
- Link to Slack thread or post-mortem
On-Call Schedule #
View schedule:
# Check who's on call
Visit: https://tofu-bonsai.pagerduty.com/schedules
On-call responsibilities:
- Respond to PagerDuty alerts within 5 minutes
- Acknowledge all incidents
- Coordinate incident response
- Document actions in PagerDuty
- Hand off unresolved incidents clearly
Datadog #
Access Setup #
-
Login to Datadog
- URL:
https://datadoghq.com(US3 region) - Use your company email for SSO
- URL:
-
Key Dashboards
- Infrastructure Overview - EKS cluster health, pod metrics
- Application Performance - API response times, throughput
- RabbitMQ Queues - Message queue depth and processing rates
- Database Performance - PostgreSQL query performance
Understanding Datadog Integration #
Datadog collects metrics through multiple channels:
# OpenTelemetry Collector sends metrics to Datadog
# Location: deployment/resources/otel-collector/deployment.yaml
Exporters:
- Datadog API endpoint: datadoghq.com (US3)
- Requires: DATADOG_API_KEY environment variable
Webapp Logging:
// Server-side logs sent to Datadog
// Location: apps/webapp/src/shared/lib/logger/datadog/server/index.ts
Endpoint: http-intake.logs.us3.datadoghq.com
Service: bonsai-webapp
Source: nextjs
Common Datadog Tasks #
View Application Logs #
- Navigate to Logs in left sidebar
- Filter by service:
service:bonsapi service:bonsai-webapp service:bonsai-invoice service:bonsai-knowledge - Use time range selector for incident timeframe
- Add filters:
status:error @http.status_code:>=500 @user_id:<specific_user>
Check API Performance #
- Go to APM → Services
- Select service (e.g.,
bonsapi) - Review metrics:
- Requests per second - Traffic volume
- Avg latency - Response time trends
- Error rate - Failed request percentage
- Click individual traces to see detailed execution
Investigate Alerts #
When an alert fires:
-
Check the alert details
- Alert name and severity
- Triggered condition
- Affected resources
-
View the metric graph
- Look for spikes or drops
- Compare to historical baseline
- Check for correlated metrics
-
Correlation analysis
- Did a deployment happen? (check GitHub Actions)
- Are multiple services affected? (infrastructure issue)
- Is it user-specific? (data/auth issue)
Create Custom Queries #
Use Datadog Query Language:
-- Find errors in the last hour
status:error service:bonsapi
-- High latency requests
@http.response_time:>5000 service:bonsapi
-- Failed RabbitMQ messages
service:bonsai-invoice @rabbitmq.status:failed
-- Database slow queries
service:bonsapi @db.statement_duration:>1000
Sentry #
Error Tracking #
Sentry captures JavaScript errors and exceptions from the webapp.
Configuration Files:
apps/webapp/sentry.edge.config.ts- Edge runtime errorsapps/webapp/sentry.server.config.ts- Server-side errors
Accessing Sentry #
-
Check Sentry DSN in Doppler:
doppler secrets get SENTRY_DSN --project bonsai --config <env> -
Log into Sentry dashboard (URL from DSN)
-
Filter by:
- Environment (dev/prod)
- Release version
- Error type
- User ID (if available)
Common Sentry Tasks #
Investigate Error Spike #
-
View error list
- Sort by frequency or recency
- Group similar errors
-
Analyze stack trace
- Identify the failing code path
- Check source maps are loaded
- Review breadcrumbs (user actions leading to error)
-
Check user impact
- How many users affected?
- Is it environment-specific?
- Can it be reproduced?
CloudWatch #
Accessing CloudWatch Logs #
CloudWatch stores logs from:
- EKS pods -
/eks/{env}/pods - EKS control plane -
/aws/eks/bonsai-app-eks-cluster-{env}/cluster
Log Retention:
- CloudWatch: 3 days (prod)
- S3 cold storage: 365 days (prod)
Using CloudWatch #
-
AWS Console Access
# Login via SSO aws sso login --profile bonsai-prod -
Navigate to CloudWatch
- AWS Console → CloudWatch → Log groups
- Select
/eks/prod/podsfor application logs
-
Search logs
- Use CloudWatch Insights for structured queries
- Filter by pod name or container
CloudWatch Insights Queries #
-- Find errors in BonsAPI
fields @timestamp, @message
| filter @logStream like /bonsapi/
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
-- High latency requests
fields @timestamp, @message, responseTime
| filter responseTime > 5000
| stats avg(responseTime), max(responseTime) by bin(5m)
-- Failed document processing
fields @timestamp, @message, documentId
| filter @message like /processing failed/
| parse @message "document=* error=*" as doc, err
| display @timestamp, doc, err
CloudWatch Metrics #
Monitor AWS infrastructure:
-
EKS Cluster Metrics
- CPU utilization
- Memory usage
- Pod count
- Network I/O
-
RDS/Database
- Connection count
- CPU and memory
- Disk I/O
- Replication lag (if applicable)
-
ElastiCache (Redis)
- Cache hit rate
- Memory usage
- Evicted keys
-
Amazon MQ (RabbitMQ)
- Queue depth
- Message rate
- Connection count
Alert Interpretation #
Common Alert Types #
High Error Rate #
Alert: API Error Rate > 5%
Service: bonsapi
Current: 12%
Investigation steps:
- Check recent deployments (last 1-2 hours)
- Review error logs in Datadog
- Check affected endpoints
- Verify database connectivity
- Check external service status (Clerk, Stripe, etc.)
High Latency #
Alert: API Response Time > 2s
Service: bonsapi
P95: 3.2s
Investigation steps:
- Check database query performance
- Review RabbitMQ queue depth
- Check Redis cache hit rate
- Analyze slow traces in Datadog APM
- Check for N+1 queries or missing indexes
Pod Restart Loop #
Alert: Pod Restarting Frequently
Pod: bonsapi-deployment-xyz123
Restarts: 5 in 10 minutes
Investigation steps:
- Check pod logs:
kubectl logs <pod-name> -n default --previous - Review resource limits:
kubectl describe pod <pod-name> - Check for OOM kills: Look for
OOMKilledstatus - Verify database connectivity
- Check secret availability
Queue Backlog #
Alert: RabbitMQ Queue Depth > 1000
Queue: invoice.processing
Depth: 2,500 messages
Investigation steps:
- Check consumer pod status
- Review consumer logs for errors
- Check processing rate vs arrival rate
- Scale consumers if needed: See RabbitMQ Management
Setting Up Monitoring for New Services #
When adding a new service:
-
Add Datadog logging
# In deployment.yaml env: - name: DATADOG_API_KEY valueFrom: secretKeyRef: name: bonsai-secret key: DATADOG_API_KEY -
Tag metrics properly
- Service name
- Environment (dev/prod)
- Version/release tag
-
Create dashboard
- Clone existing service dashboard
- Update service name filter
- Add service-specific metrics
-
Configure alerts
- Error rate threshold
- Latency threshold
- Resource usage
Troubleshooting Monitoring Issues #
Missing Metrics #
Problem: Metrics not appearing in Datadog
Solutions:
-
Verify DATADOG_API_KEY is set correctly:
kubectl get secret bonsai-secret -o jsonpath='{.data.DATADOG_API_KEY}' | base64 -d -
Check OpenTelemetry collector status:
kubectl get pods -n aws-observability kubectl logs -n aws-observability -l app.kubernetes.io/name=otel-collector -
Verify network connectivity from pods to Datadog endpoint
CloudWatch Logs Not Appearing #
Problem: Pod logs not showing in CloudWatch
Solutions:
-
Check Fluent Bit configuration:
kubectl get configmap aws-logging -n aws-observability -o yaml -
Verify IAM permissions for log publishing
-
Check Fluent Bit pods:
kubectl get pods -n aws-observability
High Datadog Costs #
Problem: Excessive log/metric volume
Solutions:
- Review log sampling rates
- Filter noisy logs at source
- Adjust metric collection intervals
- Use log exclusion filters for debug logs in production
Best Practices #
Effective Alerting #
- Alert on symptoms, not causes - Alert when users are impacted
- Set appropriate thresholds - Avoid alert fatigue
- Include runbook links - Help responders quickly take action
- Review and tune - Regularly review false positives
Dashboard Design #
- Start with key metrics - RED method (Rate, Errors, Duration)
- Add context - Related metrics, deployment markers
- Use templates - Consistent across services
- Keep it simple - Avoid information overload
Log Management #
- Structure your logs - Use JSON for easier parsing
- Include context - Request IDs, user IDs, trace IDs
- Control volume - Use appropriate log levels
- Sensitive data - Never log passwords, tokens, PII
See Also #
- Incident Response - Complete PagerDuty incident workflow
- Kubernetes Debugging - Pod and cluster troubleshooting
- Log Management - Advanced log searching and analysis
- LLM Observability - Monitoring AI/ML operations
- Infrastructure Overview - System architecture