Service Health Checks #
This runbook covers how to verify the health of all BonsAI services and their dependencies.
When to Use This Runbook #
- Performing routine health checks
- Verifying system status after deployments
- Investigating potential issues proactively
- Capacity planning and resource monitoring
- Pre-deployment verification
Service Overview #
BonsAI consists of multiple microservices:
| Service | Type | Purpose | Health Endpoint |
|---|---|---|---|
| BonsAPI | Backend API | Core REST API | /health |
| Webapp | Frontend | Next.js application | /api/health |
| bonsai-invoice | Worker | Invoice processing | (via pod status) |
| bonsai-knowledge | Worker | Knowledge extraction | (via pod status) |
| bonsai-doc-convert | Worker | Document conversion | (via pod status) |
| bonsai-notification | Worker | Notifications | (via pod status) |
| bonsai-trigger-sync | Worker | Trigger processing | (via pod status) |
| bonsai-accounting-sync | Worker | Accounting sync | (via pod status) |
Quick Health Check Script #
#!/bin/bash
# Comprehensive health check for BonsAI platform
echo "=== BonsAI Health Check ==="
echo "Time: $(date)"
echo ""
# 1. Check Kubernetes cluster
echo "📋 Kubernetes Cluster Status"
kubectl cluster-info
echo ""
# 2. Check all pods
echo "🔍 Pod Status"
kubectl get pods -o wide | grep -v "Running\|Completed" || echo "✅ All pods are running"
echo ""
# 3. Check deployments
echo "📦 Deployment Status"
kubectl get deployments
echo ""
# 4. Check services
echo "🌐 Service Status"
kubectl get services
echo ""
# 5. Check resource usage
echo "💾 Resource Usage"
kubectl top nodes
echo ""
kubectl top pods --sort-by=memory | head -10
echo ""
# 6. Test API health endpoints
echo "🏥 API Health Checks"
echo "Production API:"
curl -s https://api.gotofu.com/health || echo "❌ API health check failed"
echo ""
echo "Production Webapp:"
curl -s https://app.gotofu.com/ -I | head -1 || echo "❌ Webapp health check failed"
echo ""
# 7. Check recent events
echo "📅 Recent Events (last 10)"
kubectl get events --sort-by='.lastTimestamp' | head -10
echo ""
echo "=== Health Check Complete ==="
Individual Service Health Checks #
BonsAPI (Backend) #
Check pod status:
# List BonsAPI pods
kubectl get pods -l app=bonsapi
# Expected output:
NAME READY STATUS RESTARTS AGE
bonsapi-abc123 1/1 Running 0 5d
bonsapi-def456 1/1 Running 0 5d
Check health endpoint:
# Production
curl https://api.gotofu.com/health
# Expected response:
{"status":"healthy","timestamp":"2025-10-21T10:15:30Z"}
# Via port-forward (for detailed checks)
kubectl port-forward service/bonsapi-service 8080:8080
curl http://localhost:8080/health
Check resource usage:
# CPU and memory
kubectl top pod -l app=bonsapi
# Detailed pod information
kubectl describe pod -l app=bonsapi | grep -A 5 "Limits\|Requests"
Check logs for errors:
kubectl logs -l app=bonsapi --tail=50 | grep -i error
Health indicators:
- ✅ All pods Running (READY 1/1)
- ✅ Health endpoint returns 200
- ✅ No recent errors in logs
- ✅ CPU usage < 80% of limit
- ✅ Memory usage < 80% of limit
- ✅ Restart count = 0 or low
Webapp (Frontend) #
Check pod status:
kubectl get pods -l app=webapp
Check health:
# Production
curl -I https://app.gotofu.com/
# Should return: HTTP/2 200
Check logs:
kubectl logs -l app=webapp --tail=50
Health indicators:
- ✅ All pods Running
- ✅ HTTP 200 from homepage
- ✅ No build errors in logs
- ✅ Resource usage normal
Worker Services #
Worker services don’t have HTTP endpoints. Check via pod status and logs.
Invoice Processor:
# Pod status
kubectl get pods -l app=bonsai-invoice
# Check if processing messages
kubectl logs -l app=bonsai-invoice --tail=20
# Should see: "Processing message..." or similar
Knowledge Service:
kubectl get pods -l app=bonsai-knowledge
kubectl logs -l app=bonsai-knowledge --tail=20
Document Converter:
kubectl get pods -l app=bonsai-doc-convert
kubectl logs -l app=bonsai-doc-convert --tail=20
Health indicators for workers:
- ✅ At least 1 pod Running
- ✅ Logs show active message processing
- ✅ No connection errors in logs
- ✅ KEDA scaling working (check ScaledObject)
Dependency Health Checks #
Database (PostgreSQL/RDS) #
Check connection from API pod:
kubectl exec -it <bonsapi-pod> -- nc -zv <db-host> 5432
Via AWS Console:
# Check RDS instance status
aws rds describe-db-instances \
--db-instance-identifier bonsai-prod-db \
--query 'DBInstances[0].DBInstanceStatus'
# Should return: "available"
# Check connection count
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=bonsai-prod-db \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average
Health indicators:
- ✅ Instance status: available
- ✅ Connections < max_connections
- ✅ CPU usage < 80%
- ✅ Free memory > 20%
Redis (ElastiCache) #
Check connection from API pod:
kubectl exec -it <bonsapi-pod> -- nc -zv <redis-host> 6379
Via AWS Console:
# Check cluster status
aws elasticache describe-cache-clusters \
--cache-cluster-id bonsai-redis-prod \
--query 'CacheClusters[0].CacheClusterStatus'
Health indicators:
- ✅ Cluster status: available
- ✅ Memory usage < 80%
- ✅ CPU usage < 80%
- ✅ Evictions = 0 or low
RabbitMQ (Amazon MQ) #
Check broker status:
# Via AWS CLI
aws mq describe-broker --broker-id <broker-id>
# Check management console
# https://<rabbitmq-host>:15671
Check queue health:
See RabbitMQ Management runbook for detailed checks.
Health indicators:
- ✅ Broker status: RUNNING
- ✅ Queue depths < threshold
- ✅ Consumers connected
- ✅ Message rates normal
S3 Storage #
Test S3 access from API pod:
kubectl exec -it <bonsapi-pod> -- sh -c "
aws s3 ls s3://bonsai-documents-prod/ --max-items 1
"
Check bucket:
# Via AWS CLI
aws s3 ls s3://bonsai-documents-prod/
# Check bucket size
aws s3 ls s3://bonsai-documents-prod/ --recursive --summarize
Health indicators:
- ✅ Bucket accessible
- ✅ Storage usage within limits
- ✅ No 403/404 errors
System-Wide Health Checks #
Kubernetes Cluster #
Cluster info:
# Basic cluster health
kubectl cluster-info
# Node status
kubectl get nodes -o wide
# Component status
kubectl get componentstatuses
Health indicators:
- ✅ All nodes Ready
- ✅ Control plane components healthy
- ✅ API server responsive
Node Health #
Check node resources:
# Node resource usage
kubectl top nodes
# Detailed node info
kubectl describe nodes
# Check for pressure conditions
kubectl describe nodes | grep -A 5 "Conditions:"
Health indicators:
- ✅ CPU usage < 80%
- ✅ Memory usage < 80%
- ✅ Disk usage < 85%
- ✅ No pressure conditions (MemoryPressure, DiskPressure, PIDPressure)
Ingress/Load Balancer #
Check ingress status:
# List ingresses
kubectl get ingress
# Describe ingress
kubectl describe ingress bonsapi-ingress
# Check ALB health
aws elbv2 describe-target-health \
--target-group-arn <target-group-arn>
Health indicators:
- ✅ Ingress has address/hostname
- ✅ All targets healthy in ALB
- ✅ SSL certificates valid
Monitoring Dashboards #
Datadog Dashboard #
Key metrics to monitor:
- API Response Time - P50, P95, P99 latency
- Error Rate - 4xx and 5xx errors
- Request Volume - Requests per second
- Pod Health - Running pods vs desired
- Resource Usage - CPU, memory, disk
- Queue Depth - RabbitMQ queue lengths
- Database Performance - Query time, connections
CloudWatch Dashboard #
Access via AWS Console → CloudWatch → Dashboards
Key widgets:
- EKS cluster metrics
- RDS performance
- ElastiCache metrics
- ALB metrics
- Log insights queries
Automated Health Checks #
Kubernetes Liveness Probes #
BonsAI services have liveness probes configured:
# Example from BonsAPI deployment
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
Check probe configuration:
kubectl describe pod <pod-name> | grep -A 10 "Liveness\|Readiness"
Kubernetes Readiness Probes #
Readiness probes determine if pod should receive traffic:
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
If pod is not ready:
# Check why pod is not ready
kubectl describe pod <pod-name>
# Check logs
kubectl logs <pod-name>
Health Check Schedule #
Daily (Automated) #
- Monitoring alerts
- Kubernetes health checks
- Datadog dashboards
Weekly (Manual) #
- Review error rates and trends
- Check resource usage trends
- Review scaling patterns
- Check backup status
Monthly (Manual) #
- Capacity planning review
- Cost optimization review
- Security updates check
- Performance optimization review
Troubleshooting Unhealthy Services #
Pod Not Ready #
Investigation:
# Why is pod not ready?
kubectl describe pod <pod-name>
# Check logs
kubectl logs <pod-name>
# Check events
kubectl get events --field-selector involvedObject.name=<pod-name>
Common causes:
- Failing readiness probe
- Slow startup time
- Database connection issues
- Missing dependencies
High Resource Usage #
Investigation:
# Check resource usage
kubectl top pod <pod-name>
# Check limits
kubectl describe pod <pod-name> | grep -A 5 "Limits\|Requests"
# Check for throttling
kubectl describe pod <pod-name> | grep -i throttl
Solutions:
- Increase resource limits
- Optimize code
- Scale horizontally
Service Degradation #
Investigation:
- Check metrics in Datadog
- Review error logs
- Check dependency health
- Review recent changes
See: Incident Response for detailed procedures
Pre-Deployment Health Check #
Before deploying to production:
# 1. Verify current system health
./health-check.sh
# 2. Check resource capacity
kubectl top nodes
kubectl top pods
# 3. Verify database status
aws rds describe-db-instances --db-instance-identifier bonsai-prod-db
# 4. Check queue depths
# Via RabbitMQ management console
# 5. Review recent errors
# Via Datadog or CloudWatch
# 6. Verify backups are recent
aws rds describe-db-snapshots --db-instance-identifier bonsai-prod-db | head -20
See Also #
- Monitoring & Alerting - Setting up alerts
- Kubernetes Debugging - Troubleshooting pods
- Incident Response - Responding to health issues
- Deployment Monitoring - Post-deployment checks
- Database Access - Database health checks