Monitoring & Alerting

Monitoring & Alerting #

This runbook covers how to use BonsAI’s monitoring and alerting systems to identify and diagnose production issues.

When to Use This Runbook #

  • You received an alert notification (PagerDuty, email, Slack)
  • Investigating reported performance issues
  • Proactive health monitoring
  • Post-incident analysis and metrics review
  • Setting up new alerts

Overview of Monitoring Stack #

BonsAI uses a multi-layered monitoring approach:

Tool Purpose Access
PagerDuty Incident alerting and on-call management tofu-bonsai.pagerduty.com
Datadog Metrics, APM, logs, dashboards datadoghq.com
Sentry Error tracking and exception monitoring Check webapp configuration
CloudWatch AWS infrastructure metrics and logs AWS Console
OpenTelemetry Metric collection and forwarding Deployed in EKS cluster

PagerDuty #

Primary Incident Alerting System #

PagerDuty is our primary alerting system for production incidents. It ensures critical alerts reach the on-call engineer immediately.

Access:

  • URL: https://tofu-bonsai.pagerduty.com
  • Download mobile app for on-call notifications
  • Configure notification preferences (phone, SMS, push)

Alert Routing #

Critical Alert (SEV1)
    ↓
PagerDuty Incident Created
    ↓
┌──────────────┬──────────────┬──────────────┐
│ Phone Call   │  SMS         │  Push Notif  │
│ (immediate)  │  (immediate) │  (immediate) │
└──────────────┴──────────────┴──────────────┘
    ↓
On-call Engineer Acknowledges
    ↓
Alerts stop, incident tracked

PagerDuty Integration #

Datadog → PagerDuty:

  • Critical alerts in Datadog trigger PagerDuty incidents
  • Configured in Datadog monitor settings
  • Includes alert context and links

When You Receive a PagerDuty Alert:

  1. Review the incident

    • Check incident details in PagerDuty app/web
    • Note severity, service, and description
    • Review attached metrics/graphs
  2. Acknowledge immediately

    • Stops alert notifications
    • Marks you as incident commander
    • Starts incident timeline
  3. Investigate

    • Click through to Datadog dashboard
    • Follow incident response runbook
    • See Incident Response
  4. Update PagerDuty

    • Add notes throughout investigation
    • Update status (Acknowledged → Resolved)
    • Link to Slack thread or post-mortem

On-Call Schedule #

View schedule:

# Check who's on call
Visit: https://tofu-bonsai.pagerduty.com/schedules

On-call responsibilities:

  • Respond to PagerDuty alerts within 5 minutes
  • Acknowledge all incidents
  • Coordinate incident response
  • Document actions in PagerDuty
  • Hand off unresolved incidents clearly

Datadog #

Access Setup #

  1. Login to Datadog

    • URL: https://datadoghq.com (US3 region)
    • Use your company email for SSO
  2. Key Dashboards

    • Infrastructure Overview - EKS cluster health, pod metrics
    • Application Performance - API response times, throughput
    • RabbitMQ Queues - Message queue depth and processing rates
    • Database Performance - PostgreSQL query performance

Understanding Datadog Integration #

Datadog collects metrics through multiple channels:

# OpenTelemetry Collector sends metrics to Datadog
# Location: deployment/resources/otel-collector/deployment.yaml
Exporters:
  - Datadog API endpoint: datadoghq.com (US3)
  - Requires: DATADOG_API_KEY environment variable

Webapp Logging:

// Server-side logs sent to Datadog
// Location: apps/webapp/src/shared/lib/logger/datadog/server/index.ts
Endpoint: http-intake.logs.us3.datadoghq.com
Service: bonsai-webapp
Source: nextjs

Common Datadog Tasks #

View Application Logs #

  1. Navigate to Logs in left sidebar
  2. Filter by service:
    service:bonsapi
    service:bonsai-webapp
    service:bonsai-invoice
    service:bonsai-knowledge
    
  3. Use time range selector for incident timeframe
  4. Add filters:
    status:error
    @http.status_code:>=500
    @user_id:<specific_user>
    

Check API Performance #

  1. Go to APMServices
  2. Select service (e.g., bonsapi)
  3. Review metrics:
    • Requests per second - Traffic volume
    • Avg latency - Response time trends
    • Error rate - Failed request percentage
  4. Click individual traces to see detailed execution

Investigate Alerts #

When an alert fires:

  1. Check the alert details

    • Alert name and severity
    • Triggered condition
    • Affected resources
  2. View the metric graph

    • Look for spikes or drops
    • Compare to historical baseline
    • Check for correlated metrics
  3. Correlation analysis

    • Did a deployment happen? (check GitHub Actions)
    • Are multiple services affected? (infrastructure issue)
    • Is it user-specific? (data/auth issue)

Create Custom Queries #

Use Datadog Query Language:

-- Find errors in the last hour
status:error service:bonsapi

-- High latency requests
@http.response_time:>5000 service:bonsapi

-- Failed RabbitMQ messages
service:bonsai-invoice @rabbitmq.status:failed

-- Database slow queries
service:bonsapi @db.statement_duration:>1000

Sentry #

Error Tracking #

Sentry captures JavaScript errors and exceptions from the webapp.

Configuration Files:

  • apps/webapp/sentry.edge.config.ts - Edge runtime errors
  • apps/webapp/sentry.server.config.ts - Server-side errors

Accessing Sentry #

  1. Check Sentry DSN in Doppler:

    doppler secrets get SENTRY_DSN --project bonsai --config <env>
    
  2. Log into Sentry dashboard (URL from DSN)

  3. Filter by:

    • Environment (dev/prod)
    • Release version
    • Error type
    • User ID (if available)

Common Sentry Tasks #

Investigate Error Spike #

  1. View error list

    • Sort by frequency or recency
    • Group similar errors
  2. Analyze stack trace

    • Identify the failing code path
    • Check source maps are loaded
    • Review breadcrumbs (user actions leading to error)
  3. Check user impact

    • How many users affected?
    • Is it environment-specific?
    • Can it be reproduced?

CloudWatch #

Accessing CloudWatch Logs #

CloudWatch stores logs from:

  • EKS pods - /eks/{env}/pods
  • EKS control plane - /aws/eks/bonsai-app-eks-cluster-{env}/cluster

Log Retention:

  • CloudWatch: 3 days (prod)
  • S3 cold storage: 365 days (prod)

Using CloudWatch #

  1. AWS Console Access

    # Login via SSO
    aws sso login --profile bonsai-prod
    
  2. Navigate to CloudWatch

    • AWS Console → CloudWatch → Log groups
    • Select /eks/prod/pods for application logs
  3. Search logs

    • Use CloudWatch Insights for structured queries
    • Filter by pod name or container

CloudWatch Insights Queries #

-- Find errors in BonsAPI
fields @timestamp, @message
| filter @logStream like /bonsapi/
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- High latency requests
fields @timestamp, @message, responseTime
| filter responseTime > 5000
| stats avg(responseTime), max(responseTime) by bin(5m)

-- Failed document processing
fields @timestamp, @message, documentId
| filter @message like /processing failed/
| parse @message "document=* error=*" as doc, err
| display @timestamp, doc, err

CloudWatch Metrics #

Monitor AWS infrastructure:

  1. EKS Cluster Metrics

    • CPU utilization
    • Memory usage
    • Pod count
    • Network I/O
  2. RDS/Database

    • Connection count
    • CPU and memory
    • Disk I/O
    • Replication lag (if applicable)
  3. ElastiCache (Redis)

    • Cache hit rate
    • Memory usage
    • Evicted keys
  4. Amazon MQ (RabbitMQ)

    • Queue depth
    • Message rate
    • Connection count

Alert Interpretation #

Common Alert Types #

High Error Rate #

Alert: API Error Rate > 5%
Service: bonsapi
Current: 12%

Investigation steps:

  1. Check recent deployments (last 1-2 hours)
  2. Review error logs in Datadog
  3. Check affected endpoints
  4. Verify database connectivity
  5. Check external service status (Clerk, Stripe, etc.)

High Latency #

Alert: API Response Time > 2s
Service: bonsapi
P95: 3.2s

Investigation steps:

  1. Check database query performance
  2. Review RabbitMQ queue depth
  3. Check Redis cache hit rate
  4. Analyze slow traces in Datadog APM
  5. Check for N+1 queries or missing indexes

Pod Restart Loop #

Alert: Pod Restarting Frequently
Pod: bonsapi-deployment-xyz123
Restarts: 5 in 10 minutes

Investigation steps:

  1. Check pod logs: kubectl logs <pod-name> -n default --previous
  2. Review resource limits: kubectl describe pod <pod-name>
  3. Check for OOM kills: Look for OOMKilled status
  4. Verify database connectivity
  5. Check secret availability

Queue Backlog #

Alert: RabbitMQ Queue Depth > 1000
Queue: invoice.processing
Depth: 2,500 messages

Investigation steps:

  1. Check consumer pod status
  2. Review consumer logs for errors
  3. Check processing rate vs arrival rate
  4. Scale consumers if needed: See RabbitMQ Management

Setting Up Monitoring for New Services #

When adding a new service:

  1. Add Datadog logging

    # In deployment.yaml
    env:
      - name: DATADOG_API_KEY
        valueFrom:
          secretKeyRef:
            name: bonsai-secret
            key: DATADOG_API_KEY
    
  2. Tag metrics properly

    • Service name
    • Environment (dev/prod)
    • Version/release tag
  3. Create dashboard

    • Clone existing service dashboard
    • Update service name filter
    • Add service-specific metrics
  4. Configure alerts

    • Error rate threshold
    • Latency threshold
    • Resource usage

Troubleshooting Monitoring Issues #

Missing Metrics #

Problem: Metrics not appearing in Datadog

Solutions:

  1. Verify DATADOG_API_KEY is set correctly:

    kubectl get secret bonsai-secret -o jsonpath='{.data.DATADOG_API_KEY}' | base64 -d
    
  2. Check OpenTelemetry collector status:

    kubectl get pods -n aws-observability
    kubectl logs -n aws-observability -l app.kubernetes.io/name=otel-collector
    
  3. Verify network connectivity from pods to Datadog endpoint

CloudWatch Logs Not Appearing #

Problem: Pod logs not showing in CloudWatch

Solutions:

  1. Check Fluent Bit configuration:

    kubectl get configmap aws-logging -n aws-observability -o yaml
    
  2. Verify IAM permissions for log publishing

  3. Check Fluent Bit pods:

    kubectl get pods -n aws-observability
    

High Datadog Costs #

Problem: Excessive log/metric volume

Solutions:

  1. Review log sampling rates
  2. Filter noisy logs at source
  3. Adjust metric collection intervals
  4. Use log exclusion filters for debug logs in production

Best Practices #

Effective Alerting #

  • Alert on symptoms, not causes - Alert when users are impacted
  • Set appropriate thresholds - Avoid alert fatigue
  • Include runbook links - Help responders quickly take action
  • Review and tune - Regularly review false positives

Dashboard Design #

  • Start with key metrics - RED method (Rate, Errors, Duration)
  • Add context - Related metrics, deployment markers
  • Use templates - Consistent across services
  • Keep it simple - Avoid information overload

Log Management #

  • Structure your logs - Use JSON for easier parsing
  • Include context - Request IDs, user IDs, trace IDs
  • Control volume - Use appropriate log levels
  • Sensitive data - Never log passwords, tokens, PII

See Also #