Deployment & CI/CD Monitoring

Deployment & CI/CD Monitoring #

This runbook covers how to monitor deployments, troubleshoot CI/CD pipelines, and verify application health after releases.

When to Use This Runbook #

  • Monitoring ongoing deployments
  • Troubleshooting failed CI/CD runs
  • Verifying deployment success
  • Rolling back failed deployments
  • Investigating deployment-related incidents

Deployment Architecture #

BonsAI uses GitHub Actions for CI/CD with the following workflow:

Code Push → GitHub Actions → Build Images → Push to ECR → Deploy to EKS

Deployment Workflow:

  1. CI checks - Linting, tests, type checking
  2. Build - Docker image build via Depot
  3. Push - Images pushed to Amazon ECR
  4. Sync Secrets - Doppler → Kubernetes secrets
  5. Database Migration - Atlas migrations
  6. Deploy Services - Rolling deployment to EKS

GitHub Actions Workflows #

Main Workflows #

Workflow Trigger Purpose File
CI PR to main, push to main Linting, testing, type checking .github/workflows/⚡️ ci.yml
Deploy Workflow call Build and deploy to EKS .github/workflows/deploy.yaml
Dev Deploy Push to main Auto-deploy to dev .github/workflows/dev-deploy.yaml
Release Notes PR merge to main Generate release notes .github/workflows/release-notes.yml
Release Tags Hasami PR merge Create version tags .github/workflows/release-tags.yml

Accessing GitHub Actions #

  1. Go to GitHub Repository
  2. Click Actions tab
  3. Select workflow from left sidebar
  4. Click individual run to see details

Monitoring Deployments #

Check Active Deployment #

# View GitHub Actions via CLI (optional)
gh run list --workflow="deploy.yaml" --limit 5

# Or view in browser
# https://github.com/tofu2-limited/bonsai/actions

In GitHub Actions UI:

  1. Click on running workflow
  2. View job progress:
    • sync-secrets - Syncing Doppler to K8s
    • database-migration - Running database migrations
    • bonsapi - Deploying backend API
    • webapp - Deploying frontend
    • bonsai-invoice - Deploying invoice processor
    • bonsai-knowledge - Deploying knowledge service
    • And other services…

Watch Deployment in Kubernetes #

# Watch deployment rollout
kubectl rollout status deployment/bonsapi-deployment

# Watch pods being created
kubectl get pods --watch -l app=bonsapi

# Check deployment events
kubectl get events --sort-by='.lastTimestamp' | grep bonsapi

Verify Deployment Success #

Step 1: Check Pod Status

# All pods should be Running
kubectl get pods

# Check specific deployment
kubectl get pods -l app=bonsapi

# Expected output:
NAME                       READY   STATUS    RESTARTS   AGE
bonsapi-abc123            1/1     Running   0          2m
bonsapi-def456            1/1     Running   0          2m

Step 2: Check Service Health

# Check if service endpoints are ready
kubectl get endpoints bonsapi-service

# Port-forward and test
kubectl port-forward service/bonsapi-service 8080:8080
curl http://localhost:8080/health

Step 3: Check Application Logs

# View recent logs
kubectl logs -l app=bonsapi --tail=50

# Follow logs in real-time
kubectl logs -f deployment/bonsapi-deployment

Step 4: Check External Access

# For dev environment
curl https://api-dev.gotofu.com/health

# For production
curl https://api.gotofu.com/health

Step 5: Monitor Error Rates

  • Check Datadog for error rate spikes
  • Review Sentry for new exceptions
  • Check CloudWatch metrics

Troubleshooting CI/CD Failures #

CI Check Failures #

Common Issues:

  1. Linting Errors

    Error: Linter found issues
    

    Solution: Fix linting errors locally:

    mise run lint
    mise run lint:fix
    
  2. Type Errors

    Error: Type check failed
    

    Solution: Fix type errors:

    # TypeScript
    cd apps/webapp && pnpm typecheck
    
    # Rust
    cargo check
    
  3. Test Failures

    Error: Tests failed
    

    Solution: Run tests locally:

    mise run test
    

Build Failures #

Common Issues:

  1. Docker Build Failure

    Error: failed to solve with frontend dockerfile.v0
    

    Investigation:

    • Check Dockerfile syntax
    • Verify base image exists
    • Check build context

    Solution:

    # Test build locally
    docker build -f apps/bonsapi/Dockerfile .
    
  2. ECR Push Failure

    Error: failed to push image to ECR
    

    Investigation:

    • Check AWS credentials
    • Verify ECR repository exists
    • Check IAM permissions

    Solution:

    # Verify ECR access
    aws ecr describe-repositories --repository-names bonsapi
    
    # Login to ECR
    aws ecr get-login-password --region eu-central-1 | \
      docker login --username AWS --password-stdin <account-id>.dkr.ecr.eu-central-1.amazonaws.com
    

Deployment Failures #

Common Issues:

  1. Secret Sync Failure

    Error: Failed to sync secrets from Doppler
    

    Investigation:

    • Check Doppler service token
    • Verify External Secrets Operator status

    Solution: See Secrets Management

  2. Database Migration Failure

    Error: Migration failed
    

    Investigation:

    # Check migration logs in GitHub Actions
    # Or check migration pod logs
    kubectl logs -l job-name=database-migration
    

    Solution:

    • Review migration SQL
    • Check database connectivity
    • Verify migration hasn’t been partially applied
  3. Pod Startup Failure

    Error: Pods not reaching Ready state
    

    Investigation:

    # Check pod status
    kubectl describe pod <pod-name>
    
    # Check logs
    kubectl logs <pod-name>
    

    Common causes:

    • Database connection failure
    • Missing environment variables
    • Health check failures
    • Resource limits too low

    Solution: See Kubernetes Debugging

  4. Image Pull Error

    Error: Failed to pull image
    

    Investigation:

    kubectl describe pod <pod-name> | grep -A 5 "Failed"
    

    Causes:

    • Image doesn’t exist in ECR
    • Wrong image tag
    • ECR permissions issue

    Solution:

    # Verify image exists
    aws ecr describe-images \
      --repository-name bonsapi \
      --image-ids imageTag=<tag>
    

Rollback Procedures #

Rolling Back via Kubernetes #

Quick Rollback:

# Rollback to previous version
kubectl rollout undo deployment/bonsapi-deployment

# Monitor rollback
kubectl rollout status deployment/bonsapi-deployment

# Verify pods are healthy
kubectl get pods -l app=bonsapi

Rollback to Specific Version:

# View deployment history
kubectl rollout history deployment/bonsapi-deployment

# Rollback to specific revision
kubectl rollout undo deployment/bonsapi-deployment --to-revision=3

# Verify rollback
kubectl rollout status deployment/bonsapi-deployment

Rolling Back via GitHub Actions #

Re-deploy Previous Version:

  1. Find the last successful deployment run
  2. Click Re-run jobs in GitHub Actions
  3. Monitor deployment progress

Manual Rollback:

# Tag specific image as latest
aws ecr put-image \
  --repository-name bonsapi \
  --image-tag latest \
  --image-manifest "$(aws ecr batch-get-image --repository-name bonsapi --image-ids imageTag=<previous-tag> --query 'images[].imageManifest' --output text)"

# Force pod restart
kubectl rollout restart deployment/bonsapi-deployment

Database Rollback #

IMPORTANT: Database rollbacks are risky. Migrations are forward-only.

If you must rollback database:

  1. Assess migration impact

    • What data changes were made?
    • Are they reversible?
    • Will rollback cause data loss?
  2. Create rollback migration

    # Create new migration that reverses changes
    cd apps/bonsapi/migrations
    # Edit migration files to undo changes
    
  3. Test in dev first

    mise run db:migrate  # in dev environment
    
  4. Deploy rollback migration via GitHub Actions

Manual Deployment #

When to Deploy Manually #

  • Automated deployment failed
  • Emergency hotfix needed
  • Testing specific configuration

Manual Deployment Steps #

Prerequisites:

  • AWS CLI configured
  • kubectl configured
  • Docker installed (for building locally)

Step 1: Build Image

# Set environment variables
export AWS_ACCOUNT_ID=<account-id>
export AWS_REGION=eu-central-1
export IMAGE_TAG=$(git rev-parse --short HEAD)

# Build image
docker build -t bonsapi:$IMAGE_TAG -f apps/bonsapi/Dockerfile .

# Tag for ECR
docker tag bonsapi:$IMAGE_TAG \
  $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/bonsapi:$IMAGE_TAG

Step 2: Push to ECR

# Login to ECR
aws ecr get-login-password --region $AWS_REGION | \
  docker login --username AWS --password-stdin \
  $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

# Push image
docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/bonsapi:$IMAGE_TAG

Step 3: Update Kubernetes Deployment

# Update image in deployment
kubectl set image deployment/bonsapi-deployment \
  bonsapi=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/bonsapi:$IMAGE_TAG

# Monitor rollout
kubectl rollout status deployment/bonsapi-deployment

# Verify
kubectl get pods -l app=bonsapi

Deployment Best Practices #

Pre-Deployment Checklist #

  • All tests passing locally
  • Code reviewed and approved
  • Feature flags configured (if applicable)
  • Database migrations tested in dev
  • Monitoring and alerts configured
  • Rollback plan documented
  • Team notified of deployment

During Deployment #

  • Monitor GitHub Actions progress
  • Watch pod rollout in Kubernetes
  • Check application logs for errors
  • Verify health checks passing
  • Monitor error rates in Datadog
  • Test critical user flows

Post-Deployment #

  • Verify all services are healthy
  • Check error rates returned to baseline
  • Review Sentry for new exceptions
  • Test API endpoints
  • Monitor performance metrics
  • Update team on deployment status
  • Document any issues encountered

Deployment Timing #

  • Avoid deployments during:

    • Business hours (for production)
    • End of week/month (accounting periods)
    • Major events or promotions
  • Best times to deploy:

    • Early morning (before business hours)
    • Low-traffic periods
    • After thorough testing in dev

Monitoring Deployment Health #

Key Metrics to Watch #

  1. Error Rate

    • Should remain stable after deployment
    • Spike indicates deployment issues
  2. Response Time

    • P50, P95, P99 latency
    • Degradation indicates performance issues
  3. Request Volume

    • Should match expected traffic patterns
    • Drop indicates service unavailability
  4. Pod Restarts

    • Should be zero or minimal
    • Frequent restarts indicate instability

Health Check Endpoints #

# BonsAPI health
curl https://api.gotofu.com/health

# Webapp health (if available)
curl https://app.gotofu.com/api/health

# Individual pod health
kubectl port-forward <pod-name> 8080:8080
curl http://localhost:8080/health

Troubleshooting Deployment Performance #

Slow Rollout #

Symptoms: Deployment taking longer than expected

Investigation:

# Check pod events
kubectl describe deployment bonsapi-deployment

# Check pod scheduling
kubectl get pods -o wide

Common Causes:

  • Image pull time (large images)
  • Resource constraints (CPU/memory)
  • Health check delays
  • Node capacity issues

Failed Health Checks #

Symptoms: Pods marked as unhealthy during rollout

Investigation:

# Check pod health
kubectl describe pod <pod-name> | grep -A 10 "Readiness\|Liveness"

# Test health endpoint
kubectl exec <pod-name> -- curl http://localhost:8080/health

Solutions:

  • Increase probe initial delay
  • Fix health check endpoint
  • Check dependencies (DB, Redis, RabbitMQ)

Emergency Procedures #

Stop Deployment #

# Pause deployment
kubectl rollout pause deployment/bonsapi-deployment

# Investigate issue
# ...

# Resume or undo
kubectl rollout resume deployment/bonsapi-deployment
# OR
kubectl rollout undo deployment/bonsapi-deployment

Emergency Hotfix #

For critical production issues:

  1. Create hotfix branch from production
  2. Make minimal changes to fix issue
  3. Test thoroughly in dev
  4. Fast-track PR review
  5. Deploy immediately
  6. Monitor closely
  7. Document incident
# Create hotfix branch
git checkout -b hotfix/critical-issue

# Make fix
# ...

# Create PR
gh pr create --title "HOTFIX: Critical issue" --body "Description"

# After approval, merge and deploy
# GitHub Actions will auto-deploy to dev, then manual approve for prod

See Also #