RabbitMQ Management #
This runbook covers monitoring, troubleshooting, and managing RabbitMQ message queues for asynchronous job processing in BonsAI.
When to Use This Runbook #
- Queue backlog alerts firing
- Messages not being processed
- Consumer service failures
- Dead letter queue investigations
- Performance issues with message processing
- Queue monitoring and capacity planning
Overview #
BonsAI uses Amazon MQ (managed RabbitMQ) for asynchronous message processing across multiple services:
| Queue | Producer | Consumer | Purpose |
|---|---|---|---|
invoice.processing |
BonsAPI | bonsai-invoice | Invoice document processing |
knowledge.processing |
BonsAPI | bonsai-knowledge | Knowledge extraction |
document.conversion |
BonsAPI | bonsai-doc-convert | Document format conversion |
notifications |
Multiple | bonsai-notification | User notifications |
accounting.sync |
BonsAPI | bonsai-accounting-sync | Accounting integration sync |
Accessing RabbitMQ #
Production Access #
Prerequisites:
- AWS SSO configured
- VPN access (if required)
Step 1: Get RabbitMQ Credentials
# Get connection details from Doppler
doppler secrets get RABBITMQ_HOST RABBITMQ_PORT RABBITMQ_USER RABBITMQ_PASSWORD \
--project bonsai --config prod --plain
Step 2: Access Management Console
The RabbitMQ Management Console is available at:
https://<RABBITMQ_HOST>:15671
Login with credentials from Doppler.
Local/Development Access #
# Start local development environment
mise run dev
# RabbitMQ Management Console:
# http://localhost:15672
# Username: guest
# Password: guest
RabbitMQ Management Console #
Dashboard Overview #
The dashboard shows:
- Total connections - Number of active client connections
- Total channels - Number of channels across connections
- Total queues - Number of queues in the system
- Message rates - Publish/deliver/ack rates
- Queue health - Ready messages, unacked messages, total
Key Sections #
| Section | Purpose |
|---|---|
| Queues | View all queues, depths, and rates |
| Exchanges | View message routing configuration |
| Connections | Active client connections |
| Channels | Communication channels |
| Admin | User management, policies, limits |
Monitoring Queues #
Checking Queue Depth #
Via Management Console:
- Navigate to Queues tab
- View metrics for each queue:
- Ready - Messages waiting to be consumed
- Unacked - Messages delivered but not acknowledged
- Total - Ready + Unacked
Via kubectl (for KEDA monitoring):
# Check KEDA ScaledObject status
kubectl get scaledobject
# Describe scaled object for specific queue
kubectl describe scaledobject bonsai-invoice-scaledobject
# Check current replica count (scales based on queue depth)
kubectl get deployment bonsai-invoice-deployment
Queue Health Indicators #
Healthy Queue:
- Ready messages: Low (< 100)
- Processing rate: Stable
- Consumer count: > 0
- Message rate: Publish ≈ Deliver
Unhealthy Queue:
- Ready messages: High and growing (> 1000)
- Processing rate: Slow or zero
- Consumer count: 0 or low
- Message rate: Publish » Deliver
Investigating Queue Backlogs #
Step 1: Identify the Problem #
# Check queue metrics via Management Console
# OR use Datadog dashboard for RabbitMQ queues
Common Scenarios:
-
High Ready Count
- Messages accumulating
- Consumers not keeping up
-
High Unacked Count
- Messages delivered but not processed
- Consumer crashed before ack
- Long processing times
-
Zero Consumers
- Consumer pod crashed
- Deployment failed
- Service not started
Step 2: Check Consumer Status #
# Check consumer pods
kubectl get pods -l app=bonsai-invoice
# Check pod logs
kubectl logs -l app=bonsai-invoice --tail=100
# Check for errors
kubectl logs -l app=bonsai-invoice | grep -i error
Step 3: Check Consumer Performance #
# Check resource usage
kubectl top pod -l app=bonsai-invoice
# Check if pods are being throttled
kubectl describe pod <pod-name> | grep -A 5 "Limits\|Requests"
Step 4: Check Message Processing #
Via Management Console:
- Go to Queues tab
- Click queue name
- View Message rates graph
- Check Get messages to inspect message content
Via CLI (local development):
# Requires rabbitmqadmin tool
rabbitmqadmin get queue=invoice.processing count=5
Scaling Consumers #
BonsAI uses KEDA (Kubernetes Event-Driven Autoscaling) to automatically scale consumers based on queue depth.
Check KEDA Configuration #
# View KEDA scaled objects
kubectl get scaledobjects
# Describe specific scaled object
kubectl describe scaledobject bonsai-invoice-scaledobject
# Check scaling configuration
kubectl get scaledobject bonsai-invoice-scaledobject -o yaml
Typical KEDA Configuration:
triggers:
- type: rabbitmq
metadata:
queueName: invoice.processing
queueLength: "50" # Scale up when > 50 messages
minReplicaCount: 1 # Minimum pods
maxReplicaCount: 10 # Maximum pods
Manual Scaling #
If KEDA autoscaling isn’t keeping up:
# Manually scale deployment
kubectl scale deployment bonsai-invoice-deployment --replicas=5
# Verify scaling
kubectl get deployment bonsai-invoice-deployment
kubectl get pods -l app=bonsai-invoice
Important: Manual scaling overrides KEDA until the deployment is restarted.
Troubleshooting Consumer Issues #
Consumer Not Processing Messages #
Symptoms: Queue depth growing, consumer pods running but not processing
Investigation:
-
Check consumer logs
kubectl logs -l app=bonsai-invoice --tail=100 -
Look for common errors:
- Database connection failures
- RabbitMQ connection errors
- Message parsing errors
- Resource exhaustion
-
Test consumer connectivity
# Exec into consumer pod kubectl exec -it <pod-name> -- /bin/bash # Test RabbitMQ connection nc -zv <rabbitmq-host> 5672 # Test database connection nc -zv <database-host> 5432
Common Solutions:
-
RabbitMQ connection issue
- Verify credentials in secrets
- Check network connectivity
- Restart consumer pods
-
Database connection issue
- Check database availability
- Verify connection pool limits
- Scale down consumers if overwhelming DB
-
Message format issue
- Check message schema changes
- Review recent code deployments
- Inspect failing messages in dead letter queue
Consumer Crashing #
Symptoms: Pods restarting frequently, high restart count
Investigation:
# Check pod status
kubectl get pods -l app=bonsai-invoice
# View previous logs (from crashed container)
kubectl logs <pod-name> --previous
# Check for OOMKilled
kubectl describe pod <pod-name> | grep -A 5 "State\|Last State"
Common Causes:
-
Out of Memory (OOMKilled)
- Increase memory limits
- Fix memory leaks in code
- Optimize message processing
-
Unhandled Exception
- Fix bug in consumer code
- Add error handling
- Deploy hotfix
-
Resource Exhaustion
- Increase resource limits
- Optimize resource usage
- Scale horizontally
Slow Message Processing #
Symptoms: Messages processed slowly, queue depth growing slowly
Investigation:
# Check resource usage
kubectl top pod -l app=bonsai-invoice
# Check processing time in logs
kubectl logs -l app=bonsai-invoice | grep "processing time"
# Check database query performance
# See Database Access runbook
Solutions:
-
Optimize processing logic
- Profile slow code paths
- Optimize database queries
- Reduce external API calls
-
Increase resources
- CPU limits (if throttled)
- Memory limits (if swapping)
-
Scale horizontally
- Add more consumer pods
- Distribute load
Managing Messages #
Inspecting Messages #
Via Management Console:
- Navigate to Queues tab
- Click queue name
- Scroll to Get messages
- Set count and requeue policy
- Click Get Message(s)
Message Details:
- Payload (JSON)
- Properties (timestamp, message ID)
- Headers (custom metadata)
- Routing key
Purging Queue #
WARNING: This deletes all messages in the queue!
Via Management Console:
- Navigate to Queues tab
- Click queue name
- Scroll to Purge Messages
- Click Purge Messages button
- Confirm action
Use Cases:
- Clearing test messages in dev
- Removing invalid messages after bug fix
- Emergency queue reset (with approval)
Republishing Failed Messages #
Via Management Console:
- Get message from dead letter queue
- Copy message payload
- Go to Exchanges tab
- Select appropriate exchange
- Click Publish message
- Paste payload and configure routing
- Click Publish message
Via Code:
For bulk republishing, write a script using the RabbitMQ client library.
Dead Letter Queues #
Checking Dead Letter Queue #
Messages that fail processing may end up in dead letter queues (DLQs).
Find DLQ:
Original queue: invoice.processing
DLQ: invoice.processing.dlq
Via Management Console:
- Navigate to Queues tab
- Look for queues ending in
.dlq - Click queue to view messages
Investigating Failed Messages #
-
Get message from DLQ
-
Check message headers for failure reason:
x-first-death-reason- Why it was rejectedx-first-death-queue- Original queuex-deathcount - Number of retry attempts
-
Review consumer logs for the failure timestamp
-
Identify root cause:
- Message format issue
- Business logic error
- External service failure
- Resource issue
Reprocessing DLQ Messages #
After fixing the root cause:
- Verify fix is deployed
- Test with single message first
- Move messages back to original queue:
- Manually republish via console
- Or use shovel/federation plugins
- Or write script to republish
Performance Optimization #
Queue Performance Metrics #
Monitor these metrics in Datadog:
- Message rate - Messages/second published and delivered
- Consumer utilization - Percentage of time consumers are active
- Prefetch count - Number of unacked messages per consumer
- Processing time - Time from publish to ack
Optimizing Consumer Performance #
-
Adjust Prefetch Count
- Higher prefetch = Better throughput
- Lower prefetch = Better load distribution
- Typical: 10-50 messages
-
Optimize Message Processing
- Batch database operations
- Cache frequently accessed data
- Use async I/O where possible
-
Right-Size Resources
- CPU: Ensure not throttled
- Memory: Enough headroom for processing
- Network: Low latency to RabbitMQ and DB
-
Horizontal Scaling
- More consumers = More throughput
- Balance against resource costs
- Monitor diminishing returns
Queue Configuration #
Best Practices:
- Durable queues - Survive broker restart
- Message persistence - Messages written to disk
- TTL (Time to Live) - Expire old messages
- Max length - Limit queue depth
- Dead letter exchange - Handle failures gracefully
Monitoring and Alerts #
Key Metrics to Monitor #
-
Queue Depth
- Alert: > 1000 messages
- Critical: > 5000 messages
-
Consumer Count
- Alert: 0 consumers for > 5 minutes
- Warning: Fewer than expected consumers
-
Processing Rate
- Alert: Publish rate » Deliver rate
- Warning: Processing slower than usual
-
Message Age
- Alert: Messages older than 1 hour
- Critical: Messages older than 4 hours
Setting Up Alerts in Datadog #
# Example Datadog monitor query
avg(last_5m):avg:rabbitmq.queue.messages{queue:invoice.processing} > 1000
Alert Configuration:
- Warning: > 500 messages
- Critical: > 1000 messages
- Recovery: < 100 messages
Troubleshooting RabbitMQ Broker #
Broker Connection Issues #
Symptoms: Consumers unable to connect
Investigation:
# Check Amazon MQ broker status
aws mq describe-broker --broker-id <broker-id>
# Check security groups
aws ec2 describe-security-groups --group-ids <sg-id>
# Test connectivity from consumer pod
kubectl exec -it <pod-name> -- nc -zv <rabbitmq-host> 5672
Solutions:
- Verify broker is running
- Check security group rules
- Verify credentials
- Check VPC networking
High Memory Usage #
Symptoms: Broker memory alarm, publishers blocked
Investigation:
Via Management Console:
- Check Overview page
- Look for memory alarms
- Check queue memory usage
Solutions:
- Purge unnecessary messages
- Increase consumer throughput
- Add more broker nodes (scale up)
- Configure queue max length
Best Practices #
Message Design #
- Keep messages small (< 128 KB)
- Use message IDs for idempotency
- Include timestamps
- Use structured format (JSON)
- Don’t include large binary data
Consumer Design #
- Implement retry logic with exponential backoff
- Handle failures gracefully
- Acknowledge messages only after successful processing
- Log processing errors with context
- Monitor processing time
Operations #
- Regular monitoring of queue metrics
- Set up appropriate alerts
- Scale consumers proactively
- Test consumer failures in dev
- Document message schemas
- Plan for capacity growth
Emergency Procedures #
Queue Completely Stuck #
- Stop all producers (if possible)
- Check consumer status
- Restart consumers if needed
- Scale up consumers if backlog is large
- Monitor progress
- Resume producers when caught up
Broker Unresponsive #
- Check broker status in AWS Console
- Review CloudWatch metrics
- Check for memory/disk issues
- Contact AWS Support if broker issue
- Consider failover if multi-AZ setup
Data Loss Risk #
If messages are at risk:
- Enable persistence if not already
- Take broker snapshot (if possible)
- Pause producers to stop new messages
- Scale up consumers to drain queue quickly
- Document message counts before/after
See Also #
- Kubernetes Debugging - Consumer pod troubleshooting
- Monitoring & Alerting - Queue metrics in Datadog
- Deployment Monitoring - Consumer deployment
- Incident Response - Handling queue incidents
- Database Access - Consumer database issues