Monitoring & Alerting #

This runbook covers how to use BonsAI’s monitoring and alerting systems to identify and diagnose production issues.

When to Use This Runbook #

You received an alert notification (PagerDuty, email, Slack)
Investigating reported performance issues
Proactive health monitoring
Post-incident analysis and metrics review
Setting up new alerts

Overview of Monitoring Stack #

BonsAI uses a multi-layered monitoring approach:

Tool	Purpose	Access
PagerDuty	Incident alerting and on-call management	tofu-bonsai.pagerduty.com
Datadog	Metrics, APM, logs, dashboards	datadoghq.com
Sentry	Error tracking and exception monitoring	Check webapp configuration
CloudWatch	AWS infrastructure metrics and logs	AWS Console
OpenTelemetry	Metric collection and forwarding	Deployed in EKS cluster

PagerDuty #

Primary Incident Alerting System #

PagerDuty is our primary alerting system for production incidents. It ensures critical alerts reach the on-call engineer immediately.

Access:

URL: https://tofu-bonsai.pagerduty.com
Download mobile app for on-call notifications
Configure notification preferences (phone, SMS, push)

Alert Routing #

Critical Alert (SEV1)
    ↓
PagerDuty Incident Created
    ↓
┌──────────────┬──────────────┬──────────────┐
│ Phone Call   │  SMS         │  Push Notif  │
│ (immediate)  │  (immediate) │  (immediate) │
└──────────────┴──────────────┴──────────────┘
    ↓
On-call Engineer Acknowledges
    ↓
Alerts stop, incident tracked

PagerDuty Integration #

Datadog → PagerDuty:

Critical alerts in Datadog trigger PagerDuty incidents
Configured in Datadog monitor settings
Includes alert context and links

When You Receive a PagerDuty Alert:

Review the incident
- Check incident details in PagerDuty app/web
- Note severity, service, and description
- Review attached metrics/graphs
Acknowledge immediately
- Stops alert notifications
- Marks you as incident commander
- Starts incident timeline
Investigate
- Click through to Datadog dashboard
- Follow incident response runbook
- See Incident Response
Update PagerDuty
- Add notes throughout investigation
- Update status (Acknowledged → Resolved)
- Link to Slack thread or post-mortem

On-Call Schedule #

View schedule:

# Check who's on call
Visit: https://tofu-bonsai.pagerduty.com/schedules

On-call responsibilities:

Respond to PagerDuty alerts within 5 minutes
Acknowledge all incidents
Coordinate incident response
Document actions in PagerDuty
Hand off unresolved incidents clearly

Datadog #

Access Setup #

Login to Datadog
- URL: https://datadoghq.com (US3 region)
- Use your company email for SSO
Key Dashboards
- Infrastructure Overview - EKS cluster health, pod metrics
- Application Performance - API response times, throughput
- RabbitMQ Queues - Message queue depth and processing rates
- Database Performance - PostgreSQL query performance

Understanding Datadog Integration #

Datadog collects metrics through multiple channels:

# OpenTelemetry Collector sends metrics to Datadog
# Location: deployment/resources/otel-collector/deployment.yaml
Exporters:
  - Datadog API endpoint: datadoghq.com (US3)
  - Requires: DATADOG_API_KEY environment variable

Webapp Logging:

// Server-side logs sent to Datadog
// Location: apps/webapp/src/shared/lib/logger/datadog/server/index.ts
Endpoint: http-intake.logs.us3.datadoghq.com
Service: bonsai-webapp
Source: nextjs

Common Datadog Tasks #

View Application Logs #

Navigate to Logs in left sidebar

Filter by service:

service:bonsapi
service:bonsai-webapp
service:bonsai-invoice
service:bonsai-knowledge

Use time range selector for incident timeframe

Add filters:

status:error
@http.status_code:>=500
@user_id:<specific_user>

Check API Performance #

Go to APM → Services
Select service (e.g., bonsapi)
Review metrics:
- Requests per second - Traffic volume
- Avg latency - Response time trends
- Error rate - Failed request percentage
Click individual traces to see detailed execution

Investigate Alerts #

When an alert fires:

Check the alert details
- Alert name and severity
- Triggered condition
- Affected resources
View the metric graph
- Look for spikes or drops
- Compare to historical baseline
- Check for correlated metrics
Correlation analysis
- Did a deployment happen? (check GitHub Actions)
- Are multiple services affected? (infrastructure issue)
- Is it user-specific? (data/auth issue)

Create Custom Queries #

Use Datadog Query Language:

-- Find errors in the last hour
status:error service:bonsapi

-- High latency requests
@http.response_time:>5000 service:bonsapi

-- Failed RabbitMQ messages
service:bonsai-invoice @rabbitmq.status:failed

-- Database slow queries
service:bonsapi @db.statement_duration:>1000

Sentry #

Error Tracking #

Sentry captures JavaScript errors and exceptions from the webapp.

Configuration Files:

apps/webapp/sentry.edge.config.ts - Edge runtime errors
apps/webapp/sentry.server.config.ts - Server-side errors

Accessing Sentry #

Check Sentry DSN in Doppler:

doppler secrets get SENTRY_DSN --project bonsai --config <env>

Log into Sentry dashboard (URL from DSN)
Filter by:
- Environment (dev/prod)
- Release version
- Error type
- User ID (if available)

Common Sentry Tasks #

Investigate Error Spike #

View error list
- Sort by frequency or recency
- Group similar errors
Analyze stack trace
- Identify the failing code path
- Check source maps are loaded
- Review breadcrumbs (user actions leading to error)
Check user impact
- How many users affected?
- Is it environment-specific?
- Can it be reproduced?

CloudWatch #

Accessing CloudWatch Logs #

CloudWatch stores logs from:

EKS pods - /eks/{env}/pods
EKS control plane - /aws/eks/bonsai-app-eks-cluster-{env}/cluster

Log Retention:

CloudWatch: 3 days (prod)
S3 cold storage: 365 days (prod)

Using CloudWatch #

AWS Console Access

# Login via SSO
aws sso login --profile bonsai-prod

Navigate to CloudWatch
- AWS Console → CloudWatch → Log groups
- Select /eks/prod/pods for application logs
Search logs
- Use CloudWatch Insights for structured queries
- Filter by pod name or container

CloudWatch Insights Queries #

-- Find errors in BonsAPI
fields @timestamp, @message
| filter @logStream like /bonsapi/
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- High latency requests
fields @timestamp, @message, responseTime
| filter responseTime > 5000
| stats avg(responseTime), max(responseTime) by bin(5m)

-- Failed document processing
fields @timestamp, @message, documentId
| filter @message like /processing failed/
| parse @message "document=* error=*" as doc, err
| display @timestamp, doc, err

CloudWatch Metrics #

Monitor AWS infrastructure:

EKS Cluster Metrics
- CPU utilization
- Memory usage
- Pod count
- Network I/O
RDS/Database
- Connection count
- CPU and memory
- Disk I/O
- Replication lag (if applicable)
ElastiCache (Redis)
- Cache hit rate
- Memory usage
- Evicted keys
Amazon MQ (RabbitMQ)
- Queue depth
- Message rate
- Connection count

Alert Interpretation #

Common Alert Types #

High Error Rate #

Alert: API Error Rate > 5%
Service: bonsapi
Current: 12%

Investigation steps:

Check recent deployments (last 1-2 hours)
Review error logs in Datadog
Check affected endpoints
Verify database connectivity
Check external service status (Clerk, Stripe, etc.)

High Latency #

Alert: API Response Time > 2s
Service: bonsapi
P95: 3.2s

Investigation steps:

Check database query performance
Review RabbitMQ queue depth
Check Redis cache hit rate
Analyze slow traces in Datadog APM
Check for N+1 queries or missing indexes

Pod Restart Loop #

Alert: Pod Restarting Frequently
Pod: bonsapi-deployment-xyz123
Restarts: 5 in 10 minutes

Investigation steps:

Check pod logs: kubectl logs <pod-name> -n default --previous
Review resource limits: kubectl describe pod <pod-name>
Check for OOM kills: Look for OOMKilled status
Verify database connectivity
Check secret availability

Queue Backlog #

Alert: RabbitMQ Queue Depth > 1000
Queue: invoice.processing
Depth: 2,500 messages

Investigation steps:

Check consumer pod status
Review consumer logs for errors
Check processing rate vs arrival rate
Scale consumers if needed: See RabbitMQ Management

Setting Up Monitoring for New Services #

When adding a new service:

Add Datadog logging

# In deployment.yaml
env:
  - name: DATADOG_API_KEY
    valueFrom:
      secretKeyRef:
        name: bonsai-secret
        key: DATADOG_API_KEY

Tag metrics properly
- Service name
- Environment (dev/prod)
- Version/release tag
Create dashboard
- Clone existing service dashboard
- Update service name filter
- Add service-specific metrics
Configure alerts
- Error rate threshold
- Latency threshold
- Resource usage

Troubleshooting Monitoring Issues #

Missing Metrics #

Problem: Metrics not appearing in Datadog

Solutions:

Verify DATADOG_API_KEY is set correctly:

kubectl get secret bonsai-secret -o jsonpath='{.data.DATADOG_API_KEY}' | base64 -d

Check OpenTelemetry collector status:

kubectl get pods -n aws-observability
kubectl logs -n aws-observability -l app.kubernetes.io/name=otel-collector

Verify network connectivity from pods to Datadog endpoint

CloudWatch Logs Not Appearing #

Problem: Pod logs not showing in CloudWatch

Solutions:

Check Fluent Bit configuration:

kubectl get configmap aws-logging -n aws-observability -o yaml

Verify IAM permissions for log publishing
Check Fluent Bit pods:
```
kubectl get pods -n aws-observability
```

High Datadog Costs #

Problem: Excessive log/metric volume

Solutions:

Review log sampling rates
Filter noisy logs at source
Adjust metric collection intervals
Use log exclusion filters for debug logs in production

Best Practices #

Effective Alerting #

Alert on symptoms, not causes - Alert when users are impacted
Set appropriate thresholds - Avoid alert fatigue
Include runbook links - Help responders quickly take action
Review and tune - Regularly review false positives

Dashboard Design #

Start with key metrics - RED method (Rate, Errors, Duration)
Add context - Related metrics, deployment markers
Use templates - Consistent across services
Keep it simple - Avoid information overload

Log Management #

Structure your logs - Use JSON for easier parsing
Include context - Request IDs, user IDs, trace IDs
Control volume - Use appropriate log levels
Sensitive data - Never log passwords, tokens, PII

Monitoring & Alerting #

When to Use This Runbook #

Overview of Monitoring Stack #

PagerDuty #

Primary Incident Alerting System #

Alert Routing #

PagerDuty Integration #

On-Call Schedule #

Datadog #

Access Setup #

Understanding Datadog Integration #

Common Datadog Tasks #

View Application Logs #

Check API Performance #

Investigate Alerts #

Create Custom Queries #

Sentry #

Error Tracking #

Accessing Sentry #

Common Sentry Tasks #

Investigate Error Spike #

CloudWatch #

Accessing CloudWatch Logs #

Using CloudWatch #

CloudWatch Insights Queries #

CloudWatch Metrics #

Alert Interpretation #

Common Alert Types #

High Error Rate #

High Latency #

Pod Restart Loop #

Queue Backlog #

Setting Up Monitoring for New Services #

Troubleshooting Monitoring Issues #

Missing Metrics #

CloudWatch Logs Not Appearing #

High Datadog Costs #

Best Practices #

Effective Alerting #

Dashboard Design #

Log Management #

See Also #