Service Health Checks #

This runbook covers how to verify the health of all BonsAI services and their dependencies.

When to Use This Runbook #

Performing routine health checks
Verifying system status after deployments
Investigating potential issues proactively
Capacity planning and resource monitoring
Pre-deployment verification

Service Overview #

BonsAI consists of multiple microservices:

Service	Type	Purpose	Health Endpoint
BonsAPI	Backend API	Core REST API	`/health`
Webapp	Frontend	Next.js application	`/api/health`
bonsai-invoice	Worker	Invoice processing	(via pod status)
bonsai-knowledge	Worker	Knowledge extraction	(via pod status)
bonsai-doc-convert	Worker	Document conversion	(via pod status)
bonsai-notification	Worker	Notifications	(via pod status)
bonsai-trigger-sync	Worker	Trigger processing	(via pod status)
bonsai-accounting-sync	Worker	Accounting sync	(via pod status)

Quick Health Check Script #

#!/bin/bash
# Comprehensive health check for BonsAI platform

echo "=== BonsAI Health Check ==="
echo "Time: $(date)"
echo ""

# 1. Check Kubernetes cluster
echo "📋 Kubernetes Cluster Status"
kubectl cluster-info
echo ""

# 2. Check all pods
echo "🔍 Pod Status"
kubectl get pods -o wide | grep -v "Running\|Completed" || echo "✅ All pods are running"
echo ""

# 3. Check deployments
echo "📦 Deployment Status"
kubectl get deployments
echo ""

# 4. Check services
echo "🌐 Service Status"
kubectl get services
echo ""

# 5. Check resource usage
echo "💾 Resource Usage"
kubectl top nodes
echo ""
kubectl top pods --sort-by=memory | head -10
echo ""

# 6. Test API health endpoints
echo "🏥 API Health Checks"
echo "Production API:"
curl -s https://api.gotofu.com/health || echo "❌ API health check failed"
echo ""
echo "Production Webapp:"
curl -s https://app.gotofu.com/ -I | head -1 || echo "❌ Webapp health check failed"
echo ""

# 7. Check recent events
echo "📅 Recent Events (last 10)"
kubectl get events --sort-by='.lastTimestamp' | head -10
echo ""

echo "=== Health Check Complete ==="

Individual Service Health Checks #

BonsAPI (Backend) #

Check pod status:

# List BonsAPI pods
kubectl get pods -l app=bonsapi

# Expected output:
NAME                       READY   STATUS    RESTARTS   AGE
bonsapi-abc123            1/1     Running   0          5d
bonsapi-def456            1/1     Running   0          5d

Check health endpoint:

# Production
curl https://api.gotofu.com/health

# Expected response:
{"status":"healthy","timestamp":"2025-10-21T10:15:30Z"}

# Via port-forward (for detailed checks)
kubectl port-forward service/bonsapi-service 8080:8080
curl http://localhost:8080/health

Check resource usage:

# CPU and memory
kubectl top pod -l app=bonsapi

# Detailed pod information
kubectl describe pod -l app=bonsapi | grep -A 5 "Limits\|Requests"

Check logs for errors:

kubectl logs -l app=bonsapi --tail=50 | grep -i error

Health indicators:

✅ All pods Running (READY 1/1)
✅ Health endpoint returns 200
✅ No recent errors in logs
✅ CPU usage < 80% of limit
✅ Memory usage < 80% of limit
✅ Restart count = 0 or low

Webapp (Frontend) #

Check pod status:

kubectl get pods -l app=webapp

Check health:

# Production
curl -I https://app.gotofu.com/

# Should return: HTTP/2 200

Check logs:

kubectl logs -l app=webapp --tail=50

Health indicators:

✅ All pods Running
✅ HTTP 200 from homepage
✅ No build errors in logs
✅ Resource usage normal

Worker Services #

Worker services don’t have HTTP endpoints. Check via pod status and logs.

Invoice Processor:

# Pod status
kubectl get pods -l app=bonsai-invoice

# Check if processing messages
kubectl logs -l app=bonsai-invoice --tail=20

# Should see: "Processing message..." or similar

Knowledge Service:

kubectl get pods -l app=bonsai-knowledge
kubectl logs -l app=bonsai-knowledge --tail=20

Document Converter:

kubectl get pods -l app=bonsai-doc-convert
kubectl logs -l app=bonsai-doc-convert --tail=20

Health indicators for workers:

✅ At least 1 pod Running
✅ Logs show active message processing
✅ No connection errors in logs
✅ KEDA scaling working (check ScaledObject)

Dependency Health Checks #

Database (PostgreSQL/RDS) #

Check connection from API pod:

kubectl exec -it <bonsapi-pod> -- nc -zv <db-host> 5432

Via AWS Console:

# Check RDS instance status
aws rds describe-db-instances \
  --db-instance-identifier bonsai-prod-db \
  --query 'DBInstances[0].DBInstanceStatus'

# Should return: "available"

# Check connection count
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=bonsai-prod-db \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average

Health indicators:

✅ Instance status: available
✅ Connections < max_connections
✅ CPU usage < 80%
✅ Free memory > 20%

Redis (ElastiCache) #

Check connection from API pod:

kubectl exec -it <bonsapi-pod> -- nc -zv <redis-host> 6379

Via AWS Console:

# Check cluster status
aws elasticache describe-cache-clusters \
  --cache-cluster-id bonsai-redis-prod \
  --query 'CacheClusters[0].CacheClusterStatus'

Health indicators:

✅ Cluster status: available
✅ Memory usage < 80%
✅ CPU usage < 80%
✅ Evictions = 0 or low

RabbitMQ (Amazon MQ) #

Check broker status:

# Via AWS CLI
aws mq describe-broker --broker-id <broker-id>

# Check management console
# https://<rabbitmq-host>:15671

Check queue health:

See RabbitMQ Management runbook for detailed checks.

Health indicators:

✅ Broker status: RUNNING
✅ Queue depths < threshold
✅ Consumers connected
✅ Message rates normal

S3 Storage #

Test S3 access from API pod:

kubectl exec -it <bonsapi-pod> -- sh -c "
  aws s3 ls s3://bonsai-documents-prod/ --max-items 1
"

Check bucket:

# Via AWS CLI
aws s3 ls s3://bonsai-documents-prod/

# Check bucket size
aws s3 ls s3://bonsai-documents-prod/ --recursive --summarize

Health indicators:

✅ Bucket accessible
✅ Storage usage within limits
✅ No 403/404 errors

System-Wide Health Checks #

Kubernetes Cluster #

Cluster info:

# Basic cluster health
kubectl cluster-info

# Node status
kubectl get nodes -o wide

# Component status
kubectl get componentstatuses

Health indicators:

✅ All nodes Ready
✅ Control plane components healthy
✅ API server responsive

Node Health #

Check node resources:

# Node resource usage
kubectl top nodes

# Detailed node info
kubectl describe nodes

# Check for pressure conditions
kubectl describe nodes | grep -A 5 "Conditions:"

Health indicators:

✅ CPU usage < 80%
✅ Memory usage < 80%
✅ Disk usage < 85%
✅ No pressure conditions (MemoryPressure, DiskPressure, PIDPressure)

Ingress/Load Balancer #

Check ingress status:

# List ingresses
kubectl get ingress

# Describe ingress
kubectl describe ingress bonsapi-ingress

# Check ALB health
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn>

Health indicators:

✅ Ingress has address/hostname
✅ All targets healthy in ALB
✅ SSL certificates valid

Monitoring Dashboards #

Datadog Dashboard #

Key metrics to monitor:

API Response Time - P50, P95, P99 latency
Error Rate - 4xx and 5xx errors
Request Volume - Requests per second
Pod Health - Running pods vs desired
Resource Usage - CPU, memory, disk
Queue Depth - RabbitMQ queue lengths
Database Performance - Query time, connections

CloudWatch Dashboard #

Access via AWS Console → CloudWatch → Dashboards

Key widgets:

EKS cluster metrics
RDS performance
ElastiCache metrics
ALB metrics
Log insights queries

Automated Health Checks #

Kubernetes Liveness Probes #

BonsAI services have liveness probes configured:

# Example from BonsAPI deployment
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Check probe configuration:

kubectl describe pod <pod-name> | grep -A 10 "Liveness\|Readiness"

Kubernetes Readiness Probes #

Readiness probes determine if pod should receive traffic:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

If pod is not ready:

# Check why pod is not ready
kubectl describe pod <pod-name>

# Check logs
kubectl logs <pod-name>

Health Check Schedule #

Daily (Automated) #

Monitoring alerts
Kubernetes health checks
Datadog dashboards

Weekly (Manual) #

Review error rates and trends
Check resource usage trends
Review scaling patterns
Check backup status

Monthly (Manual) #

Capacity planning review
Cost optimization review
Security updates check
Performance optimization review

Troubleshooting Unhealthy Services #

Pod Not Ready #

Investigation:

# Why is pod not ready?
kubectl describe pod <pod-name>

# Check logs
kubectl logs <pod-name>

# Check events
kubectl get events --field-selector involvedObject.name=<pod-name>

Common causes:

Failing readiness probe
Slow startup time
Database connection issues
Missing dependencies

High Resource Usage #

Investigation:

# Check resource usage
kubectl top pod <pod-name>

# Check limits
kubectl describe pod <pod-name> | grep -A 5 "Limits\|Requests"

# Check for throttling
kubectl describe pod <pod-name> | grep -i throttl

Solutions:

Increase resource limits
Optimize code
Scale horizontally

Service Degradation #

Investigation:

Check metrics in Datadog
Review error logs
Check dependency health
Review recent changes

See: Incident Response for detailed procedures

Pre-Deployment Health Check #

Before deploying to production:

# 1. Verify current system health
./health-check.sh

# 2. Check resource capacity
kubectl top nodes
kubectl top pods

# 3. Verify database status
aws rds describe-db-instances --db-instance-identifier bonsai-prod-db

# 4. Check queue depths
# Via RabbitMQ management console

# 5. Review recent errors
# Via Datadog or CloudWatch

# 6. Verify backups are recent
aws rds describe-db-snapshots --db-instance-identifier bonsai-prod-db | head -20

Service Health Checks #

When to Use This Runbook #

Service Overview #

Quick Health Check Script #

Individual Service Health Checks #

BonsAPI (Backend) #

Webapp (Frontend) #

Worker Services #

Dependency Health Checks #

Database (PostgreSQL/RDS) #

Redis (ElastiCache) #

RabbitMQ (Amazon MQ) #

S3 Storage #

System-Wide Health Checks #

Kubernetes Cluster #

Node Health #

Ingress/Load Balancer #

Monitoring Dashboards #

Datadog Dashboard #

CloudWatch Dashboard #

Automated Health Checks #

Kubernetes Liveness Probes #

Kubernetes Readiness Probes #

Health Check Schedule #

Daily (Automated) #

Weekly (Manual) #

Monthly (Manual) #

Troubleshooting Unhealthy Services #

Pod Not Ready #

High Resource Usage #

Service Degradation #

Pre-Deployment Health Check #

See Also #