Kubernetes Debugging #
This runbook covers troubleshooting and debugging applications running on AWS EKS (Elastic Kubernetes Service) using kubectl and related tools.
When to Use This Runbook #
- Pod crashes or restart loops
- Service unavailability or connection issues
- Deployment failures or rollout problems
- Resource exhaustion (CPU, memory)
- Container startup failures
- Investigating production incidents
Prerequisites #
Required Tools #
- AWS CLI (v2.19.4+)
- kubectl (compatible with EKS version)
- Configured AWS SSO profile
AWS SSO Setup #
Step 1: Configure AWS CLI with SSO
aws configure sso
Provide the following information:
- SSO session name:
bonsai(or any name you prefer) - SSO start URL:
https://tofu-bonsai.awsapps.com/start - SSO region:
eu-central-1 - SSO registration scopes:
sso:account:access(default) - CLI default region:
eu-central-1 - CLI output format:
json(recommended) - Profile name:
bonsai-prodorbonsai-dev
Step 2: Select Role
You’ll be prompted to choose a role:
full_access- Access to both dev and productionbonsai_developers- Access to dev onlyBilling- Billing access only
Step 3: Verify Configuration
# Test your profile
aws sts get-caller-identity --profile bonsai-prod
# Expected output:
{
"UserId": "XXXXXXXXXXXXXXXX:your-email@gotofu.com",
"Account": "123456789012",
"Arn": "arn:aws:sts::123456789012:assumed-role/AWSReservedSSO_XXX_YYYYY/your-email@gotofu.com"
}
Step 4: Configure kubectl
# For production
aws eks update-kubeconfig \
--region eu-central-1 \
--name bonsai-app-eks-cluster-prod \
--profile bonsai-prod
# For dev
aws eks update-kubeconfig \
--region eu-central-1 \
--name bonsai-app-eks-cluster-dev \
--profile bonsai-dev
Step 5: Test kubectl Access
# List pods
kubectl get pods
# If you see a list of running pods, you're ready to go!
Switching Between Environments #
# View available contexts
kubectl config get-contexts
# Switch to prod
kubectl config use-context arn:aws:eks:eu-central-1:<account-id>:cluster/bonsai-app-eks-cluster-prod
# Switch to dev
kubectl config use-context arn:aws:eks:eu-central-1:<account-id>:cluster/bonsai-app-eks-cluster-dev
Common kubectl Commands #
Viewing Resources #
List All Pods #
# All pods in default namespace
kubectl get pods
# All pods with more details
kubectl get pods -o wide
# All pods in all namespaces
kubectl get pods --all-namespaces
# Watch pods in real-time
kubectl get pods --watch
Filter Pods by Service #
# BonsAPI pods
kubectl get pods -l app=bonsapi
# Webapp pods
kubectl get pods -l app=webapp
# Invoice processing pods
kubectl get pods -l app=bonsai-invoice
# Knowledge service pods
kubectl get pods -l app=bonsai-knowledge
Check Pod Status #
# Describe pod (detailed information)
kubectl describe pod <pod-name>
# Get pod logs
kubectl logs <pod-name>
# Get logs from previous container (if pod crashed)
kubectl logs <pod-name> --previous
# Follow logs in real-time
kubectl logs -f <pod-name>
# Get logs from specific container in multi-container pod
kubectl logs <pod-name> -c <container-name>
Deployments and Services #
View Deployments #
# List all deployments
kubectl get deployments
# Describe deployment
kubectl describe deployment <deployment-name>
# View deployment rollout status
kubectl rollout status deployment/<deployment-name>
# View rollout history
kubectl rollout history deployment/<deployment-name>
View Services #
# List all services
kubectl get services
# Describe service
kubectl describe service <service-name>
# Test service connectivity
kubectl get endpoints <service-name>
View ConfigMaps and Secrets #
# List ConfigMaps
kubectl get configmaps
# View ConfigMap contents
kubectl describe configmap <configmap-name>
# List secrets
kubectl get secrets
# Decode secret (careful with sensitive data!)
kubectl get secret <secret-name> -o jsonpath='{.data.<key>}' | base64 -d
Troubleshooting Common Issues #
Pod Not Starting (Pending State) #
Symptoms:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
bonsapi-xyz123 0/1 Pending 0 5m
Investigation:
# Check pod events
kubectl describe pod bonsapi-xyz123
# Common issues:
# - Insufficient CPU/memory
# - Image pull errors
# - Node selector constraints
# - Volume mounting issues
Common Causes & Solutions:
-
Insufficient Resources
Events: Warning FailedScheduling pod has unbound immediate PersistentVolumeClaimsSolution: Scale up node group or reduce resource requests
-
Image Pull Errors
Events: Warning Failed Failed to pull image "bonsapi:latest"Solution: Check ECR permissions, verify image exists
Pod Crashing (CrashLoopBackOff) #
Symptoms:
NAME READY STATUS RESTARTS AGE
bonsapi-xyz123 0/1 CrashLoopBackOff 5 10m
Investigation:
# View current logs
kubectl logs bonsapi-xyz123
# View logs from previous crash
kubectl logs bonsapi-xyz123 --previous
# Check events
kubectl describe pod bonsapi-xyz123
Common Causes & Solutions:
-
Application Error on Startup
- Check logs for error messages
- Verify environment variables are set correctly
- Check database connectivity
-
Liveness/Readiness Probe Failures
kubectl describe pod bonsapi-xyz123 | grep -A 10 "Liveness\|Readiness"Solution: Adjust probe timing or fix endpoint
-
OOMKilled (Out of Memory)
State: Terminated Reason: OOMKilledSolution: Increase memory limits or fix memory leak
High Restart Count #
Symptoms:
NAME READY STATUS RESTARTS AGE
bonsapi-xyz123 1/1 Running 15 2h
Investigation:
# View restart events
kubectl describe pod bonsapi-xyz123 | grep -A 5 "State\|Last State"
# Check logs for patterns
kubectl logs bonsapi-xyz123 --previous | tail -100
# Check resource usage
kubectl top pod bonsapi-xyz123
Common Causes:
- Memory leaks causing OOM
- Unhandled exceptions
- Health check failures
- Database connection issues
Service Not Reachable #
Symptoms: API requests failing, 502/503 errors
Investigation:
# Check service endpoints
kubectl get endpoints bonsapi-service
# Should show pod IPs:
NAME ENDPOINTS AGE
bonsapi-service 10.0.1.5:8080,10.0.1.6:8080 24h
# If empty, pods aren't matching service selector
kubectl describe service bonsapi-service
kubectl get pods --show-labels
Solutions:
-
No Endpoints
- Verify pod labels match service selector
- Check pod readiness probes
-
Ingress Issues
kubectl get ingress kubectl describe ingress bonsapi-ingressCheck:
- ALB health checks
- Certificate status
- Backend service configuration
Slow Performance or High Latency #
Investigation:
# Check resource usage
kubectl top pods
# Sample output:
NAME CPU(cores) MEMORY(bytes)
bonsapi-xyz123 950m 1800Mi
# Check resource limits
kubectl describe pod bonsapi-xyz123 | grep -A 5 "Limits\|Requests"
Common Causes:
-
CPU Throttling
- Pod using 100% of CPU limit
- Solution: Increase CPU limits or optimize code
-
Memory Pressure
- Pod near memory limit
- Solution: Increase memory or optimize usage
-
Network Issues
- Check service mesh/network policies
- Test connectivity between services
Advanced Debugging #
Executing Commands in Pods #
# Get shell access to pod
kubectl exec -it <pod-name> -- /bin/bash
# Or sh if bash isn't available
kubectl exec -it <pod-name> -- /bin/sh
# Run single command
kubectl exec <pod-name> -- ls /app
# For multi-container pods, specify container
kubectl exec -it <pod-name> -c <container-name> -- /bin/bash
Common Debugging Commands Inside Pods #
# Check environment variables
env | grep -i database
# Test network connectivity
curl http://database:5432
nc -zv rabbitmq 5672
# Check disk usage
df -h
# Check running processes
ps aux
# View application logs
tail -f /var/log/app.log
# Test DNS resolution
nslookup rabbitmq
dig database
Port Forwarding #
Forward pod port to your local machine:
# Forward pod port 8080 to localhost:8080
kubectl port-forward pod/<pod-name> 8080:8080
# Forward service port
kubectl port-forward service/bonsapi-service 8080:8080
# Now access at http://localhost:8080
Copying Files To/From Pods #
# Copy file from pod to local
kubectl cp <pod-name>:/path/to/file ./local-file
# Copy file from local to pod
kubectl cp ./local-file <pod-name>:/path/to/destination
# For multi-container pods
kubectl cp <pod-name>:/path/to/file ./local-file -c <container-name>
Viewing Events #
# All events in namespace
kubectl get events --sort-by='.lastTimestamp'
# Filter by pod
kubectl get events --field-selector involvedObject.name=<pod-name>
# Watch events in real-time
kubectl get events --watch
Working with Deployments #
Scaling Deployments #
# Scale deployment to 3 replicas
kubectl scale deployment bonsapi-deployment --replicas=3
# Check scaling status
kubectl get deployment bonsapi-deployment
Rolling Restart #
# Restart all pods in deployment
kubectl rollout restart deployment/bonsapi-deployment
# Watch rollout status
kubectl rollout status deployment/bonsapi-deployment
Rollback Deployment #
# View rollout history
kubectl rollout history deployment/bonsapi-deployment
# Rollback to previous version
kubectl rollout undo deployment/bonsapi-deployment
# Rollback to specific revision
kubectl rollout undo deployment/bonsapi-deployment --to-revision=3
Checking Resource Usage #
Node Resources #
# View node resource usage
kubectl top nodes
# Describe node for detailed info
kubectl describe node <node-name>
# View node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'
Pod Resources #
# Resource usage for all pods
kubectl top pods
# Resource usage for specific pod
kubectl top pod <pod-name>
# Sort by memory usage
kubectl top pods --sort-by=memory
# Sort by CPU usage
kubectl top pods --sort-by=cpu
Monitoring Deployments #
KEDA Autoscaling #
BonsAI uses KEDA for RabbitMQ-based autoscaling.
# View KEDA scaled objects
kubectl get scaledobjects
# Describe scaled object
kubectl describe scaledobject bonsai-invoice-scaledobject
# Check KEDA operator logs
kubectl logs -n default -l app=keda-operator
Horizontal Pod Autoscaler #
# View HPA status
kubectl get hpa
# Describe HPA
kubectl describe hpa <hpa-name>
Troubleshooting Secrets and ConfigMaps #
Checking Secret Values #
# List all secrets
kubectl get secrets
# View secret (base64 encoded)
kubectl get secret bonsai-secret -o yaml
# Decode specific key
kubectl get secret bonsai-secret -o jsonpath='{.data.DATABASE_URL}' | base64 -d
WARNING: Be careful with production secrets. Don’t log them or share them insecurely.
Verifying External Secrets #
BonsAI uses External Secrets Operator to sync from Doppler.
# Check external secret status
kubectl get externalsecrets
# Describe external secret
kubectl describe externalsecret bonsai-external-secret
# Check secret store
kubectl describe secretstore doppler-secret-store
# View External Secrets operator logs
kubectl logs -n default -l app.kubernetes.io/name=external-secrets
Forcing Secret Refresh #
# Delete external secret to force refresh
kubectl delete externalsecret bonsai-external-secret
# Recreate (this will sync from Doppler)
kubectl apply -f deployment/resources/manifests/external_secret.yaml
Checking Logs Across Multiple Pods #
Using Labels #
# Logs from all BonsAPI pods
kubectl logs -l app=bonsapi --tail=100
# Follow logs from all pods
kubectl logs -l app=bonsapi -f
# Logs from all pods in last 10 minutes
kubectl logs -l app=bonsapi --since=10m
Using stern (if installed) #
Stern provides better multi-pod log viewing:
# Install stern
brew install stern # macOS
# or download from: https://github.com/stern/stern
# View logs from all bonsapi pods
stern bonsapi
# Filter by namespace and labels
stern -n default -l app=bonsapi
Networking Debugging #
DNS Issues #
# Run DNS debugging pod
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
# Inside the pod:
nslookup bonsapi-service
nslookup rabbitmq
Network Connectivity #
# Run network debugging pod
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- bash
# Inside the pod:
curl http://bonsapi-service:8080/health
nc -zv rabbitmq 5672
Checking Service Mesh #
# View service endpoints
kubectl get endpoints
# Check network policies
kubectl get networkpolicies
Emergency Procedures #
Kill Stuck Pod #
# Graceful deletion (30 second grace period)
kubectl delete pod <pod-name>
# Force delete (immediate)
kubectl delete pod <pod-name> --grace-period=0 --force
Drain Node for Maintenance #
# Cordon node (prevent new pods)
kubectl cordon <node-name>
# Drain node (evict pods gracefully)
kubectl drain <node-name> --ignore-daemonsets
# Uncordon node (allow pods again)
kubectl uncordon <node-name>
Emergency Scale Down #
# Scale to 0 replicas (stops all pods)
kubectl scale deployment <deployment-name> --replicas=0
# Scale back up
kubectl scale deployment <deployment-name> --replicas=3
Best Practices #
Safe Operations #
- Always verify the environment (dev/prod) before running commands
- Use
--dry-run=clientflag to preview changes - Check resource status before and after changes
- Keep a terminal log of all operations during incidents
Monitoring #
- Use
kubectl get eventsto understand cluster state - Check resource usage regularly with
kubectl top - Monitor pod restarts as early warning signal
- Set up alerts for high restart counts
Security #
- Never log or share secret values
- Use RBAC appropriately (read-only vs admin)
- Audit production kubectl operations
- Rotate credentials regularly
Efficiency #
- Use aliases for common commands
- Learn kubectl auto-completion
- Use labels effectively for filtering
- Save complex queries as scripts
Useful kubectl Aliases #
Add to your ~/.bashrc or ~/.zshrc:
alias k='kubectl'
alias kg='kubectl get'
alias kd='kubectl describe'
alias kl='kubectl logs'
alias kx='kubectl exec -it'
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kgs='kubectl get services'
See Also #
- Kubernetes Setup - Initial EKS configuration
- Deployment Monitoring - Tracking deployments
- Secrets Management - Using Doppler and External Secrets
- Monitoring & Alerting - Datadog and CloudWatch
- RabbitMQ Management - Queue debugging
- Incident Response - Coordinating incident resolution