Service Health Checks

Service Health Checks #

This runbook covers how to verify the health of all BonsAI services and their dependencies.

When to Use This Runbook #

  • Performing routine health checks
  • Verifying system status after deployments
  • Investigating potential issues proactively
  • Capacity planning and resource monitoring
  • Pre-deployment verification

Service Overview #

BonsAI consists of multiple microservices:

Service Type Purpose Health Endpoint
BonsAPI Backend API Core REST API /health
Webapp Frontend Next.js application /api/health
bonsai-invoice Worker Invoice processing (via pod status)
bonsai-knowledge Worker Knowledge extraction (via pod status)
bonsai-doc-convert Worker Document conversion (via pod status)
bonsai-notification Worker Notifications (via pod status)
bonsai-trigger-sync Worker Trigger processing (via pod status)
bonsai-accounting-sync Worker Accounting sync (via pod status)

Quick Health Check Script #

#!/bin/bash
# Comprehensive health check for BonsAI platform

echo "=== BonsAI Health Check ==="
echo "Time: $(date)"
echo ""

# 1. Check Kubernetes cluster
echo "📋 Kubernetes Cluster Status"
kubectl cluster-info
echo ""

# 2. Check all pods
echo "🔍 Pod Status"
kubectl get pods -o wide | grep -v "Running\|Completed" || echo "✅ All pods are running"
echo ""

# 3. Check deployments
echo "📦 Deployment Status"
kubectl get deployments
echo ""

# 4. Check services
echo "🌐 Service Status"
kubectl get services
echo ""

# 5. Check resource usage
echo "💾 Resource Usage"
kubectl top nodes
echo ""
kubectl top pods --sort-by=memory | head -10
echo ""

# 6. Test API health endpoints
echo "🏥 API Health Checks"
echo "Production API:"
curl -s https://api.gotofu.com/health || echo "❌ API health check failed"
echo ""
echo "Production Webapp:"
curl -s https://app.gotofu.com/ -I | head -1 || echo "❌ Webapp health check failed"
echo ""

# 7. Check recent events
echo "📅 Recent Events (last 10)"
kubectl get events --sort-by='.lastTimestamp' | head -10
echo ""

echo "=== Health Check Complete ==="

Individual Service Health Checks #

BonsAPI (Backend) #

Check pod status:

# List BonsAPI pods
kubectl get pods -l app=bonsapi

# Expected output:
NAME                       READY   STATUS    RESTARTS   AGE
bonsapi-abc123            1/1     Running   0          5d
bonsapi-def456            1/1     Running   0          5d

Check health endpoint:

# Production
curl https://api.gotofu.com/health

# Expected response:
{"status":"healthy","timestamp":"2025-10-21T10:15:30Z"}

# Via port-forward (for detailed checks)
kubectl port-forward service/bonsapi-service 8080:8080
curl http://localhost:8080/health

Check resource usage:

# CPU and memory
kubectl top pod -l app=bonsapi

# Detailed pod information
kubectl describe pod -l app=bonsapi | grep -A 5 "Limits\|Requests"

Check logs for errors:

kubectl logs -l app=bonsapi --tail=50 | grep -i error

Health indicators:

  • ✅ All pods Running (READY 1/1)
  • ✅ Health endpoint returns 200
  • ✅ No recent errors in logs
  • ✅ CPU usage < 80% of limit
  • ✅ Memory usage < 80% of limit
  • ✅ Restart count = 0 or low

Webapp (Frontend) #

Check pod status:

kubectl get pods -l app=webapp

Check health:

# Production
curl -I https://app.gotofu.com/

# Should return: HTTP/2 200

Check logs:

kubectl logs -l app=webapp --tail=50

Health indicators:

  • ✅ All pods Running
  • ✅ HTTP 200 from homepage
  • ✅ No build errors in logs
  • ✅ Resource usage normal

Worker Services #

Worker services don’t have HTTP endpoints. Check via pod status and logs.

Invoice Processor:

# Pod status
kubectl get pods -l app=bonsai-invoice

# Check if processing messages
kubectl logs -l app=bonsai-invoice --tail=20

# Should see: "Processing message..." or similar

Knowledge Service:

kubectl get pods -l app=bonsai-knowledge
kubectl logs -l app=bonsai-knowledge --tail=20

Document Converter:

kubectl get pods -l app=bonsai-doc-convert
kubectl logs -l app=bonsai-doc-convert --tail=20

Health indicators for workers:

  • ✅ At least 1 pod Running
  • ✅ Logs show active message processing
  • ✅ No connection errors in logs
  • ✅ KEDA scaling working (check ScaledObject)

Dependency Health Checks #

Database (PostgreSQL/RDS) #

Check connection from API pod:

kubectl exec -it <bonsapi-pod> -- nc -zv <db-host> 5432

Via AWS Console:

# Check RDS instance status
aws rds describe-db-instances \
  --db-instance-identifier bonsai-prod-db \
  --query 'DBInstances[0].DBInstanceStatus'

# Should return: "available"

# Check connection count
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=bonsai-prod-db \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average

Health indicators:

  • ✅ Instance status: available
  • ✅ Connections < max_connections
  • ✅ CPU usage < 80%
  • ✅ Free memory > 20%

Redis (ElastiCache) #

Check connection from API pod:

kubectl exec -it <bonsapi-pod> -- nc -zv <redis-host> 6379

Via AWS Console:

# Check cluster status
aws elasticache describe-cache-clusters \
  --cache-cluster-id bonsai-redis-prod \
  --query 'CacheClusters[0].CacheClusterStatus'

Health indicators:

  • ✅ Cluster status: available
  • ✅ Memory usage < 80%
  • ✅ CPU usage < 80%
  • ✅ Evictions = 0 or low

RabbitMQ (Amazon MQ) #

Check broker status:

# Via AWS CLI
aws mq describe-broker --broker-id <broker-id>

# Check management console
# https://<rabbitmq-host>:15671

Check queue health:

See RabbitMQ Management runbook for detailed checks.

Health indicators:

  • ✅ Broker status: RUNNING
  • ✅ Queue depths < threshold
  • ✅ Consumers connected
  • ✅ Message rates normal

S3 Storage #

Test S3 access from API pod:

kubectl exec -it <bonsapi-pod> -- sh -c "
  aws s3 ls s3://bonsai-documents-prod/ --max-items 1
"

Check bucket:

# Via AWS CLI
aws s3 ls s3://bonsai-documents-prod/

# Check bucket size
aws s3 ls s3://bonsai-documents-prod/ --recursive --summarize

Health indicators:

  • ✅ Bucket accessible
  • ✅ Storage usage within limits
  • ✅ No 403/404 errors

System-Wide Health Checks #

Kubernetes Cluster #

Cluster info:

# Basic cluster health
kubectl cluster-info

# Node status
kubectl get nodes -o wide

# Component status
kubectl get componentstatuses

Health indicators:

  • ✅ All nodes Ready
  • ✅ Control plane components healthy
  • ✅ API server responsive

Node Health #

Check node resources:

# Node resource usage
kubectl top nodes

# Detailed node info
kubectl describe nodes

# Check for pressure conditions
kubectl describe nodes | grep -A 5 "Conditions:"

Health indicators:

  • ✅ CPU usage < 80%
  • ✅ Memory usage < 80%
  • ✅ Disk usage < 85%
  • ✅ No pressure conditions (MemoryPressure, DiskPressure, PIDPressure)

Ingress/Load Balancer #

Check ingress status:

# List ingresses
kubectl get ingress

# Describe ingress
kubectl describe ingress bonsapi-ingress

# Check ALB health
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn>

Health indicators:

  • ✅ Ingress has address/hostname
  • ✅ All targets healthy in ALB
  • ✅ SSL certificates valid

Monitoring Dashboards #

Datadog Dashboard #

Key metrics to monitor:

  • API Response Time - P50, P95, P99 latency
  • Error Rate - 4xx and 5xx errors
  • Request Volume - Requests per second
  • Pod Health - Running pods vs desired
  • Resource Usage - CPU, memory, disk
  • Queue Depth - RabbitMQ queue lengths
  • Database Performance - Query time, connections

CloudWatch Dashboard #

Access via AWS Console → CloudWatch → Dashboards

Key widgets:

  • EKS cluster metrics
  • RDS performance
  • ElastiCache metrics
  • ALB metrics
  • Log insights queries

Automated Health Checks #

Kubernetes Liveness Probes #

BonsAI services have liveness probes configured:

# Example from BonsAPI deployment
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Check probe configuration:

kubectl describe pod <pod-name> | grep -A 10 "Liveness\|Readiness"

Kubernetes Readiness Probes #

Readiness probes determine if pod should receive traffic:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

If pod is not ready:

# Check why pod is not ready
kubectl describe pod <pod-name>

# Check logs
kubectl logs <pod-name>

Health Check Schedule #

Daily (Automated) #

  • Monitoring alerts
  • Kubernetes health checks
  • Datadog dashboards

Weekly (Manual) #

  • Review error rates and trends
  • Check resource usage trends
  • Review scaling patterns
  • Check backup status

Monthly (Manual) #

  • Capacity planning review
  • Cost optimization review
  • Security updates check
  • Performance optimization review

Troubleshooting Unhealthy Services #

Pod Not Ready #

Investigation:

# Why is pod not ready?
kubectl describe pod <pod-name>

# Check logs
kubectl logs <pod-name>

# Check events
kubectl get events --field-selector involvedObject.name=<pod-name>

Common causes:

  • Failing readiness probe
  • Slow startup time
  • Database connection issues
  • Missing dependencies

High Resource Usage #

Investigation:

# Check resource usage
kubectl top pod <pod-name>

# Check limits
kubectl describe pod <pod-name> | grep -A 5 "Limits\|Requests"

# Check for throttling
kubectl describe pod <pod-name> | grep -i throttl

Solutions:

  • Increase resource limits
  • Optimize code
  • Scale horizontally

Service Degradation #

Investigation:

  1. Check metrics in Datadog
  2. Review error logs
  3. Check dependency health
  4. Review recent changes

See: Incident Response for detailed procedures

Pre-Deployment Health Check #

Before deploying to production:

# 1. Verify current system health
./health-check.sh

# 2. Check resource capacity
kubectl top nodes
kubectl top pods

# 3. Verify database status
aws rds describe-db-instances --db-instance-identifier bonsai-prod-db

# 4. Check queue depths
# Via RabbitMQ management console

# 5. Review recent errors
# Via Datadog or CloudWatch

# 6. Verify backups are recent
aws rds describe-db-snapshots --db-instance-identifier bonsai-prod-db | head -20

See Also #