Kubernetes Debugging #

This runbook covers troubleshooting and debugging applications running on AWS EKS (Elastic Kubernetes Service) using kubectl and related tools.

When to Use This Runbook #

Pod crashes or restart loops
Service unavailability or connection issues
Deployment failures or rollout problems
Resource exhaustion (CPU, memory)
Container startup failures
Investigating production incidents

Prerequisites #

Required Tools #

AWS CLI (v2.19.4+)
kubectl (compatible with EKS version)
Configured AWS SSO profile

AWS SSO Setup #

Step 1: Configure AWS CLI with SSO

aws configure sso

Provide the following information:

SSO session name: bonsai (or any name you prefer)
SSO start URL: https://tofu-bonsai.awsapps.com/start
SSO region: eu-central-1
SSO registration scopes: sso:account:access (default)
CLI default region: eu-central-1
CLI output format: json (recommended)
Profile name: bonsai-prod or bonsai-dev

Step 2: Select Role

You’ll be prompted to choose a role:

full_access - Access to both dev and production
bonsai_developers - Access to dev only
Billing - Billing access only

Step 3: Verify Configuration

# Test your profile
aws sts get-caller-identity --profile bonsai-prod

# Expected output:
{
    "UserId": "XXXXXXXXXXXXXXXX:your-email@gotofu.com",
    "Account": "123456789012",
    "Arn": "arn:aws:sts::123456789012:assumed-role/AWSReservedSSO_XXX_YYYYY/your-email@gotofu.com"
}

Step 4: Configure kubectl

# For production
aws eks update-kubeconfig \
  --region eu-central-1 \
  --name bonsai-app-eks-cluster-prod \
  --profile bonsai-prod

# For dev
aws eks update-kubeconfig \
  --region eu-central-1 \
  --name bonsai-app-eks-cluster-dev \
  --profile bonsai-dev

Step 5: Test kubectl Access

# List pods
kubectl get pods

# If you see a list of running pods, you're ready to go!

Switching Between Environments #

# View available contexts
kubectl config get-contexts

# Switch to prod
kubectl config use-context arn:aws:eks:eu-central-1:<account-id>:cluster/bonsai-app-eks-cluster-prod

# Switch to dev
kubectl config use-context arn:aws:eks:eu-central-1:<account-id>:cluster/bonsai-app-eks-cluster-dev

Common kubectl Commands #

Viewing Resources #

List All Pods #

# All pods in default namespace
kubectl get pods

# All pods with more details
kubectl get pods -o wide

# All pods in all namespaces
kubectl get pods --all-namespaces

# Watch pods in real-time
kubectl get pods --watch

Filter Pods by Service #

# BonsAPI pods
kubectl get pods -l app=bonsapi

# Webapp pods
kubectl get pods -l app=webapp

# Invoice processing pods
kubectl get pods -l app=bonsai-invoice

# Knowledge service pods
kubectl get pods -l app=bonsai-knowledge

Check Pod Status #

# Describe pod (detailed information)
kubectl describe pod <pod-name>

# Get pod logs
kubectl logs <pod-name>

# Get logs from previous container (if pod crashed)
kubectl logs <pod-name> --previous

# Follow logs in real-time
kubectl logs -f <pod-name>

# Get logs from specific container in multi-container pod
kubectl logs <pod-name> -c <container-name>

Deployments and Services #

View Deployments #

# List all deployments
kubectl get deployments

# Describe deployment
kubectl describe deployment <deployment-name>

# View deployment rollout status
kubectl rollout status deployment/<deployment-name>

# View rollout history
kubectl rollout history deployment/<deployment-name>

View Services #

# List all services
kubectl get services

# Describe service
kubectl describe service <service-name>

# Test service connectivity
kubectl get endpoints <service-name>

View ConfigMaps and Secrets #

# List ConfigMaps
kubectl get configmaps

# View ConfigMap contents
kubectl describe configmap <configmap-name>

# List secrets
kubectl get secrets

# Decode secret (careful with sensitive data!)
kubectl get secret <secret-name> -o jsonpath='{.data.<key>}' | base64 -d

Troubleshooting Common Issues #

Pod Not Starting (Pending State) #

Symptoms:

$ kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
bonsapi-xyz123           0/1     Pending   0          5m

Investigation:

# Check pod events
kubectl describe pod bonsapi-xyz123

# Common issues:
# - Insufficient CPU/memory
# - Image pull errors
# - Node selector constraints
# - Volume mounting issues

Common Causes & Solutions:

Insufficient Resources

Events:
  Warning  FailedScheduling  pod has unbound immediate PersistentVolumeClaims

Solution: Scale up node group or reduce resource requests

Image Pull Errors
```
Events:
  Warning  Failed  Failed to pull image "bonsapi:latest"
```
Solution: Check ECR permissions, verify image exists

Pod Crashing (CrashLoopBackOff) #

Symptoms:

NAME                      READY   STATUS             RESTARTS   AGE
bonsapi-xyz123           0/1     CrashLoopBackOff   5          10m

Investigation:

# View current logs
kubectl logs bonsapi-xyz123

# View logs from previous crash
kubectl logs bonsapi-xyz123 --previous

# Check events
kubectl describe pod bonsapi-xyz123

Common Causes & Solutions:

Application Error on Startup
- Check logs for error messages
- Verify environment variables are set correctly
- Check database connectivity
Liveness/Readiness Probe Failures
```
kubectl describe pod bonsapi-xyz123 | grep -A 10 "Liveness\|Readiness"
```
Solution: Adjust probe timing or fix endpoint
OOMKilled (Out of Memory)
```
State:          Terminated
  Reason:       OOMKilled
```
Solution: Increase memory limits or fix memory leak

High Restart Count #

Symptoms:

NAME                      READY   STATUS    RESTARTS   AGE
bonsapi-xyz123           1/1     Running   15         2h

Investigation:

# View restart events
kubectl describe pod bonsapi-xyz123 | grep -A 5 "State\|Last State"

# Check logs for patterns
kubectl logs bonsapi-xyz123 --previous | tail -100

# Check resource usage
kubectl top pod bonsapi-xyz123

Common Causes:

Memory leaks causing OOM
Unhandled exceptions
Health check failures
Database connection issues

Service Not Reachable #

Symptoms: API requests failing, 502/503 errors

Investigation:

# Check service endpoints
kubectl get endpoints bonsapi-service

# Should show pod IPs:
NAME             ENDPOINTS                           AGE
bonsapi-service  10.0.1.5:8080,10.0.1.6:8080        24h

# If empty, pods aren't matching service selector
kubectl describe service bonsapi-service
kubectl get pods --show-labels

Solutions:

No Endpoints
- Verify pod labels match service selector
- Check pod readiness probes
Ingress Issues
```
kubectl get ingress
kubectl describe ingress bonsapi-ingress
```
Check:
- ALB health checks
- Certificate status
- Backend service configuration

Slow Performance or High Latency #

Investigation:

# Check resource usage
kubectl top pods

# Sample output:
NAME                      CPU(cores)   MEMORY(bytes)
bonsapi-xyz123           950m         1800Mi

# Check resource limits
kubectl describe pod bonsapi-xyz123 | grep -A 5 "Limits\|Requests"

Common Causes:

CPU Throttling
- Pod using 100% of CPU limit
- Solution: Increase CPU limits or optimize code
Memory Pressure
- Pod near memory limit
- Solution: Increase memory or optimize usage
Network Issues
- Check service mesh/network policies
- Test connectivity between services

Advanced Debugging #

Executing Commands in Pods #

# Get shell access to pod
kubectl exec -it <pod-name> -- /bin/bash

# Or sh if bash isn't available
kubectl exec -it <pod-name> -- /bin/sh

# Run single command
kubectl exec <pod-name> -- ls /app

# For multi-container pods, specify container
kubectl exec -it <pod-name> -c <container-name> -- /bin/bash

Common Debugging Commands Inside Pods #

# Check environment variables
env | grep -i database

# Test network connectivity
curl http://database:5432
nc -zv rabbitmq 5672

# Check disk usage
df -h

# Check running processes
ps aux

# View application logs
tail -f /var/log/app.log

# Test DNS resolution
nslookup rabbitmq
dig database

Port Forwarding #

Forward pod port to your local machine:

# Forward pod port 8080 to localhost:8080
kubectl port-forward pod/<pod-name> 8080:8080

# Forward service port
kubectl port-forward service/bonsapi-service 8080:8080

# Now access at http://localhost:8080

Copying Files To/From Pods #

# Copy file from pod to local
kubectl cp <pod-name>:/path/to/file ./local-file

# Copy file from local to pod
kubectl cp ./local-file <pod-name>:/path/to/destination

# For multi-container pods
kubectl cp <pod-name>:/path/to/file ./local-file -c <container-name>

Viewing Events #

# All events in namespace
kubectl get events --sort-by='.lastTimestamp'

# Filter by pod
kubectl get events --field-selector involvedObject.name=<pod-name>

# Watch events in real-time
kubectl get events --watch

Working with Deployments #

Scaling Deployments #

# Scale deployment to 3 replicas
kubectl scale deployment bonsapi-deployment --replicas=3

# Check scaling status
kubectl get deployment bonsapi-deployment

Rolling Restart #

# Restart all pods in deployment
kubectl rollout restart deployment/bonsapi-deployment

# Watch rollout status
kubectl rollout status deployment/bonsapi-deployment

Rollback Deployment #

# View rollout history
kubectl rollout history deployment/bonsapi-deployment

# Rollback to previous version
kubectl rollout undo deployment/bonsapi-deployment

# Rollback to specific revision
kubectl rollout undo deployment/bonsapi-deployment --to-revision=3

Checking Resource Usage #

Node Resources #

# View node resource usage
kubectl top nodes

# Describe node for detailed info
kubectl describe node <node-name>

# View node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

Pod Resources #

# Resource usage for all pods
kubectl top pods

# Resource usage for specific pod
kubectl top pod <pod-name>

# Sort by memory usage
kubectl top pods --sort-by=memory

# Sort by CPU usage
kubectl top pods --sort-by=cpu

Monitoring Deployments #

KEDA Autoscaling #

BonsAI uses KEDA for RabbitMQ-based autoscaling.

# View KEDA scaled objects
kubectl get scaledobjects

# Describe scaled object
kubectl describe scaledobject bonsai-invoice-scaledobject

# Check KEDA operator logs
kubectl logs -n default -l app=keda-operator

Horizontal Pod Autoscaler #

# View HPA status
kubectl get hpa

# Describe HPA
kubectl describe hpa <hpa-name>

Troubleshooting Secrets and ConfigMaps #

Checking Secret Values #

# List all secrets
kubectl get secrets

# View secret (base64 encoded)
kubectl get secret bonsai-secret -o yaml

# Decode specific key
kubectl get secret bonsai-secret -o jsonpath='{.data.DATABASE_URL}' | base64 -d

WARNING: Be careful with production secrets. Don’t log them or share them insecurely.

Verifying External Secrets #

BonsAI uses External Secrets Operator to sync from Doppler.

# Check external secret status
kubectl get externalsecrets

# Describe external secret
kubectl describe externalsecret bonsai-external-secret

# Check secret store
kubectl describe secretstore doppler-secret-store

# View External Secrets operator logs
kubectl logs -n default -l app.kubernetes.io/name=external-secrets

Forcing Secret Refresh #

# Delete external secret to force refresh
kubectl delete externalsecret bonsai-external-secret

# Recreate (this will sync from Doppler)
kubectl apply -f deployment/resources/manifests/external_secret.yaml

Checking Logs Across Multiple Pods #

Using Labels #

# Logs from all BonsAPI pods
kubectl logs -l app=bonsapi --tail=100

# Follow logs from all pods
kubectl logs -l app=bonsapi -f

# Logs from all pods in last 10 minutes
kubectl logs -l app=bonsapi --since=10m

Using stern (if installed) #

Stern provides better multi-pod log viewing:

# Install stern
brew install stern  # macOS
# or download from: https://github.com/stern/stern

# View logs from all bonsapi pods
stern bonsapi

# Filter by namespace and labels
stern -n default -l app=bonsapi

Networking Debugging #

DNS Issues #

# Run DNS debugging pod
kubectl run -it --rm debug --image=busybox --restart=Never -- sh

# Inside the pod:
nslookup bonsapi-service
nslookup rabbitmq

Network Connectivity #

# Run network debugging pod
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- bash

# Inside the pod:
curl http://bonsapi-service:8080/health
nc -zv rabbitmq 5672

Checking Service Mesh #

# View service endpoints
kubectl get endpoints

# Check network policies
kubectl get networkpolicies

Emergency Procedures #

Kill Stuck Pod #

# Graceful deletion (30 second grace period)
kubectl delete pod <pod-name>

# Force delete (immediate)
kubectl delete pod <pod-name> --grace-period=0 --force

Drain Node for Maintenance #

# Cordon node (prevent new pods)
kubectl cordon <node-name>

# Drain node (evict pods gracefully)
kubectl drain <node-name> --ignore-daemonsets

# Uncordon node (allow pods again)
kubectl uncordon <node-name>

Emergency Scale Down #

# Scale to 0 replicas (stops all pods)
kubectl scale deployment <deployment-name> --replicas=0

# Scale back up
kubectl scale deployment <deployment-name> --replicas=3

Best Practices #

Safe Operations #

Always verify the environment (dev/prod) before running commands
Use --dry-run=client flag to preview changes
Check resource status before and after changes
Keep a terminal log of all operations during incidents

Monitoring #

Use kubectl get events to understand cluster state
Check resource usage regularly with kubectl top
Monitor pod restarts as early warning signal
Set up alerts for high restart counts

Security #

Never log or share secret values
Use RBAC appropriately (read-only vs admin)
Audit production kubectl operations
Rotate credentials regularly

Efficiency #

Use aliases for common commands
Learn kubectl auto-completion
Use labels effectively for filtering
Save complex queries as scripts

Useful kubectl Aliases #

Add to your ~/.bashrc or ~/.zshrc:

alias k='kubectl'
alias kg='kubectl get'
alias kd='kubectl describe'
alias kl='kubectl logs'
alias kx='kubectl exec -it'
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kgs='kubectl get services'

Kubernetes Debugging #

When to Use This Runbook #

Prerequisites #

Required Tools #

AWS SSO Setup #

Switching Between Environments #

Common kubectl Commands #

Viewing Resources #

List All Pods #

Filter Pods by Service #

Check Pod Status #

Deployments and Services #

View Deployments #

View Services #

View ConfigMaps and Secrets #

Troubleshooting Common Issues #

Pod Not Starting (Pending State) #

Pod Crashing (CrashLoopBackOff) #

High Restart Count #

Service Not Reachable #

Slow Performance or High Latency #

Advanced Debugging #

Executing Commands in Pods #

Common Debugging Commands Inside Pods #

Port Forwarding #

Copying Files To/From Pods #

Viewing Events #

Working with Deployments #

Scaling Deployments #

Rolling Restart #

Rollback Deployment #

Checking Resource Usage #

Node Resources #

Pod Resources #

Monitoring Deployments #

KEDA Autoscaling #

Horizontal Pod Autoscaler #

Troubleshooting Secrets and ConfigMaps #

Checking Secret Values #

Verifying External Secrets #

Forcing Secret Refresh #

Checking Logs Across Multiple Pods #

Using Labels #

Using stern (if installed) #

Networking Debugging #

DNS Issues #

Network Connectivity #

Checking Service Mesh #

Emergency Procedures #

Kill Stuck Pod #

Drain Node for Maintenance #

Emergency Scale Down #

Best Practices #

Safe Operations #

Monitoring #

Security #

Efficiency #

Useful kubectl Aliases #

See Also #