Deployment & CI/CD Monitoring #
This runbook covers how to monitor deployments, troubleshoot CI/CD pipelines, and verify application health after releases.
When to Use This Runbook #
- Monitoring ongoing deployments
- Troubleshooting failed CI/CD runs
- Verifying deployment success
- Rolling back failed deployments
- Investigating deployment-related incidents
Deployment Architecture #
BonsAI uses GitHub Actions for CI/CD with the following workflow:
Code Push → GitHub Actions → Build Images → Push to ECR → Deploy to EKS
Deployment Workflow:
- CI checks - Linting, tests, type checking
- Build - Docker image build via Depot
- Push - Images pushed to Amazon ECR
- Sync Secrets - Doppler → Kubernetes secrets
- Database Migration - Atlas migrations
- Deploy Services - Rolling deployment to EKS
GitHub Actions Workflows #
Main Workflows #
| Workflow | Trigger | Purpose | File |
|---|---|---|---|
| CI | PR to main, push to main | Linting, testing, type checking | .github/workflows/⚡️ ci.yml |
| Deploy | Workflow call | Build and deploy to EKS | .github/workflows/deploy.yaml |
| Dev Deploy | Push to main | Auto-deploy to dev | .github/workflows/dev-deploy.yaml |
| Release Notes | PR merge to main | Generate release notes | .github/workflows/release-notes.yml |
| Release Tags | Hasami PR merge | Create version tags | .github/workflows/release-tags.yml |
Accessing GitHub Actions #
- Go to GitHub Repository
- Click Actions tab
- Select workflow from left sidebar
- Click individual run to see details
Monitoring Deployments #
Check Active Deployment #
# View GitHub Actions via CLI (optional)
gh run list --workflow="deploy.yaml" --limit 5
# Or view in browser
# https://github.com/tofu2-limited/bonsai/actions
In GitHub Actions UI:
- Click on running workflow
- View job progress:
sync-secrets- Syncing Doppler to K8sdatabase-migration- Running database migrationsbonsapi- Deploying backend APIwebapp- Deploying frontendbonsai-invoice- Deploying invoice processorbonsai-knowledge- Deploying knowledge service- And other services…
Watch Deployment in Kubernetes #
# Watch deployment rollout
kubectl rollout status deployment/bonsapi-deployment
# Watch pods being created
kubectl get pods --watch -l app=bonsapi
# Check deployment events
kubectl get events --sort-by='.lastTimestamp' | grep bonsapi
Verify Deployment Success #
Step 1: Check Pod Status
# All pods should be Running
kubectl get pods
# Check specific deployment
kubectl get pods -l app=bonsapi
# Expected output:
NAME READY STATUS RESTARTS AGE
bonsapi-abc123 1/1 Running 0 2m
bonsapi-def456 1/1 Running 0 2m
Step 2: Check Service Health
# Check if service endpoints are ready
kubectl get endpoints bonsapi-service
# Port-forward and test
kubectl port-forward service/bonsapi-service 8080:8080
curl http://localhost:8080/health
Step 3: Check Application Logs
# View recent logs
kubectl logs -l app=bonsapi --tail=50
# Follow logs in real-time
kubectl logs -f deployment/bonsapi-deployment
Step 4: Check External Access
# For dev environment
curl https://api-dev.gotofu.com/health
# For production
curl https://api.gotofu.com/health
Step 5: Monitor Error Rates
- Check Datadog for error rate spikes
- Review Sentry for new exceptions
- Check CloudWatch metrics
Troubleshooting CI/CD Failures #
CI Check Failures #
Common Issues:
-
Linting Errors
Error: Linter found issuesSolution: Fix linting errors locally:
mise run lint mise run lint:fix -
Type Errors
Error: Type check failedSolution: Fix type errors:
# TypeScript cd apps/webapp && pnpm typecheck # Rust cargo check -
Test Failures
Error: Tests failedSolution: Run tests locally:
mise run test
Build Failures #
Common Issues:
-
Docker Build Failure
Error: failed to solve with frontend dockerfile.v0Investigation:
- Check Dockerfile syntax
- Verify base image exists
- Check build context
Solution:
# Test build locally docker build -f apps/bonsapi/Dockerfile . -
ECR Push Failure
Error: failed to push image to ECRInvestigation:
- Check AWS credentials
- Verify ECR repository exists
- Check IAM permissions
Solution:
# Verify ECR access aws ecr describe-repositories --repository-names bonsapi # Login to ECR aws ecr get-login-password --region eu-central-1 | \ docker login --username AWS --password-stdin <account-id>.dkr.ecr.eu-central-1.amazonaws.com
Deployment Failures #
Common Issues:
-
Secret Sync Failure
Error: Failed to sync secrets from DopplerInvestigation:
- Check Doppler service token
- Verify External Secrets Operator status
Solution: See Secrets Management
-
Database Migration Failure
Error: Migration failedInvestigation:
# Check migration logs in GitHub Actions # Or check migration pod logs kubectl logs -l job-name=database-migrationSolution:
- Review migration SQL
- Check database connectivity
- Verify migration hasn’t been partially applied
-
Pod Startup Failure
Error: Pods not reaching Ready stateInvestigation:
# Check pod status kubectl describe pod <pod-name> # Check logs kubectl logs <pod-name>Common causes:
- Database connection failure
- Missing environment variables
- Health check failures
- Resource limits too low
Solution: See Kubernetes Debugging
-
Image Pull Error
Error: Failed to pull imageInvestigation:
kubectl describe pod <pod-name> | grep -A 5 "Failed"Causes:
- Image doesn’t exist in ECR
- Wrong image tag
- ECR permissions issue
Solution:
# Verify image exists aws ecr describe-images \ --repository-name bonsapi \ --image-ids imageTag=<tag>
Rollback Procedures #
Rolling Back via Kubernetes #
Quick Rollback:
# Rollback to previous version
kubectl rollout undo deployment/bonsapi-deployment
# Monitor rollback
kubectl rollout status deployment/bonsapi-deployment
# Verify pods are healthy
kubectl get pods -l app=bonsapi
Rollback to Specific Version:
# View deployment history
kubectl rollout history deployment/bonsapi-deployment
# Rollback to specific revision
kubectl rollout undo deployment/bonsapi-deployment --to-revision=3
# Verify rollback
kubectl rollout status deployment/bonsapi-deployment
Rolling Back via GitHub Actions #
Re-deploy Previous Version:
- Find the last successful deployment run
- Click Re-run jobs in GitHub Actions
- Monitor deployment progress
Manual Rollback:
# Tag specific image as latest
aws ecr put-image \
--repository-name bonsapi \
--image-tag latest \
--image-manifest "$(aws ecr batch-get-image --repository-name bonsapi --image-ids imageTag=<previous-tag> --query 'images[].imageManifest' --output text)"
# Force pod restart
kubectl rollout restart deployment/bonsapi-deployment
Database Rollback #
IMPORTANT: Database rollbacks are risky. Migrations are forward-only.
If you must rollback database:
-
Assess migration impact
- What data changes were made?
- Are they reversible?
- Will rollback cause data loss?
-
Create rollback migration
# Create new migration that reverses changes cd apps/bonsapi/migrations # Edit migration files to undo changes -
Test in dev first
mise run db:migrate # in dev environment -
Deploy rollback migration via GitHub Actions
Manual Deployment #
When to Deploy Manually #
- Automated deployment failed
- Emergency hotfix needed
- Testing specific configuration
Manual Deployment Steps #
Prerequisites:
- AWS CLI configured
- kubectl configured
- Docker installed (for building locally)
Step 1: Build Image
# Set environment variables
export AWS_ACCOUNT_ID=<account-id>
export AWS_REGION=eu-central-1
export IMAGE_TAG=$(git rev-parse --short HEAD)
# Build image
docker build -t bonsapi:$IMAGE_TAG -f apps/bonsapi/Dockerfile .
# Tag for ECR
docker tag bonsapi:$IMAGE_TAG \
$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/bonsapi:$IMAGE_TAG
Step 2: Push to ECR
# Login to ECR
aws ecr get-login-password --region $AWS_REGION | \
docker login --username AWS --password-stdin \
$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
# Push image
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/bonsapi:$IMAGE_TAG
Step 3: Update Kubernetes Deployment
# Update image in deployment
kubectl set image deployment/bonsapi-deployment \
bonsapi=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/bonsapi:$IMAGE_TAG
# Monitor rollout
kubectl rollout status deployment/bonsapi-deployment
# Verify
kubectl get pods -l app=bonsapi
Deployment Best Practices #
Pre-Deployment Checklist #
- All tests passing locally
- Code reviewed and approved
- Feature flags configured (if applicable)
- Database migrations tested in dev
- Monitoring and alerts configured
- Rollback plan documented
- Team notified of deployment
During Deployment #
- Monitor GitHub Actions progress
- Watch pod rollout in Kubernetes
- Check application logs for errors
- Verify health checks passing
- Monitor error rates in Datadog
- Test critical user flows
Post-Deployment #
- Verify all services are healthy
- Check error rates returned to baseline
- Review Sentry for new exceptions
- Test API endpoints
- Monitor performance metrics
- Update team on deployment status
- Document any issues encountered
Deployment Timing #
-
Avoid deployments during:
- Business hours (for production)
- End of week/month (accounting periods)
- Major events or promotions
-
Best times to deploy:
- Early morning (before business hours)
- Low-traffic periods
- After thorough testing in dev
Monitoring Deployment Health #
Key Metrics to Watch #
-
Error Rate
- Should remain stable after deployment
- Spike indicates deployment issues
-
Response Time
- P50, P95, P99 latency
- Degradation indicates performance issues
-
Request Volume
- Should match expected traffic patterns
- Drop indicates service unavailability
-
Pod Restarts
- Should be zero or minimal
- Frequent restarts indicate instability
Health Check Endpoints #
# BonsAPI health
curl https://api.gotofu.com/health
# Webapp health (if available)
curl https://app.gotofu.com/api/health
# Individual pod health
kubectl port-forward <pod-name> 8080:8080
curl http://localhost:8080/health
Troubleshooting Deployment Performance #
Slow Rollout #
Symptoms: Deployment taking longer than expected
Investigation:
# Check pod events
kubectl describe deployment bonsapi-deployment
# Check pod scheduling
kubectl get pods -o wide
Common Causes:
- Image pull time (large images)
- Resource constraints (CPU/memory)
- Health check delays
- Node capacity issues
Failed Health Checks #
Symptoms: Pods marked as unhealthy during rollout
Investigation:
# Check pod health
kubectl describe pod <pod-name> | grep -A 10 "Readiness\|Liveness"
# Test health endpoint
kubectl exec <pod-name> -- curl http://localhost:8080/health
Solutions:
- Increase probe initial delay
- Fix health check endpoint
- Check dependencies (DB, Redis, RabbitMQ)
Emergency Procedures #
Stop Deployment #
# Pause deployment
kubectl rollout pause deployment/bonsapi-deployment
# Investigate issue
# ...
# Resume or undo
kubectl rollout resume deployment/bonsapi-deployment
# OR
kubectl rollout undo deployment/bonsapi-deployment
Emergency Hotfix #
For critical production issues:
- Create hotfix branch from production
- Make minimal changes to fix issue
- Test thoroughly in dev
- Fast-track PR review
- Deploy immediately
- Monitor closely
- Document incident
# Create hotfix branch
git checkout -b hotfix/critical-issue
# Make fix
# ...
# Create PR
gh pr create --title "HOTFIX: Critical issue" --body "Description"
# After approval, merge and deploy
# GitHub Actions will auto-deploy to dev, then manual approve for prod
See Also #
- Kubernetes Debugging - Troubleshooting deployments
- Secrets Management - Managing environment variables
- Monitoring & Alerting - Tracking deployment health
- Rollback Procedures - Detailed rollback guide
- Incident Response - Managing deployment incidents
- CI/CD Troubleshooting - Advanced CI/CD debugging