RabbitMQ Management

RabbitMQ Management #

This runbook covers monitoring, troubleshooting, and managing RabbitMQ message queues for asynchronous job processing in BonsAI.

When to Use This Runbook #

  • Queue backlog alerts firing
  • Messages not being processed
  • Consumer service failures
  • Dead letter queue investigations
  • Performance issues with message processing
  • Queue monitoring and capacity planning

Overview #

BonsAI uses Amazon MQ (managed RabbitMQ) for asynchronous message processing across multiple services:

Queue Producer Consumer Purpose
invoice.processing BonsAPI bonsai-invoice Invoice document processing
knowledge.processing BonsAPI bonsai-knowledge Knowledge extraction
document.conversion BonsAPI bonsai-doc-convert Document format conversion
notifications Multiple bonsai-notification User notifications
accounting.sync BonsAPI bonsai-accounting-sync Accounting integration sync

Accessing RabbitMQ #

Production Access #

Prerequisites:

  • AWS SSO configured
  • VPN access (if required)

Step 1: Get RabbitMQ Credentials

# Get connection details from Doppler
doppler secrets get RABBITMQ_HOST RABBITMQ_PORT RABBITMQ_USER RABBITMQ_PASSWORD \
  --project bonsai --config prod --plain

Step 2: Access Management Console

The RabbitMQ Management Console is available at:

https://<RABBITMQ_HOST>:15671

Login with credentials from Doppler.

Local/Development Access #

# Start local development environment
mise run dev

# RabbitMQ Management Console:
# http://localhost:15672
# Username: guest
# Password: guest

RabbitMQ Management Console #

Dashboard Overview #

The dashboard shows:

  • Total connections - Number of active client connections
  • Total channels - Number of channels across connections
  • Total queues - Number of queues in the system
  • Message rates - Publish/deliver/ack rates
  • Queue health - Ready messages, unacked messages, total

Key Sections #

Section Purpose
Queues View all queues, depths, and rates
Exchanges View message routing configuration
Connections Active client connections
Channels Communication channels
Admin User management, policies, limits

Monitoring Queues #

Checking Queue Depth #

Via Management Console:

  1. Navigate to Queues tab
  2. View metrics for each queue:
    • Ready - Messages waiting to be consumed
    • Unacked - Messages delivered but not acknowledged
    • Total - Ready + Unacked

Via kubectl (for KEDA monitoring):

# Check KEDA ScaledObject status
kubectl get scaledobject

# Describe scaled object for specific queue
kubectl describe scaledobject bonsai-invoice-scaledobject

# Check current replica count (scales based on queue depth)
kubectl get deployment bonsai-invoice-deployment

Queue Health Indicators #

Healthy Queue:

  • Ready messages: Low (< 100)
  • Processing rate: Stable
  • Consumer count: > 0
  • Message rate: Publish ≈ Deliver

Unhealthy Queue:

  • Ready messages: High and growing (> 1000)
  • Processing rate: Slow or zero
  • Consumer count: 0 or low
  • Message rate: Publish » Deliver

Investigating Queue Backlogs #

Step 1: Identify the Problem #

# Check queue metrics via Management Console
# OR use Datadog dashboard for RabbitMQ queues

Common Scenarios:

  1. High Ready Count

    • Messages accumulating
    • Consumers not keeping up
  2. High Unacked Count

    • Messages delivered but not processed
    • Consumer crashed before ack
    • Long processing times
  3. Zero Consumers

    • Consumer pod crashed
    • Deployment failed
    • Service not started

Step 2: Check Consumer Status #

# Check consumer pods
kubectl get pods -l app=bonsai-invoice

# Check pod logs
kubectl logs -l app=bonsai-invoice --tail=100

# Check for errors
kubectl logs -l app=bonsai-invoice | grep -i error

Step 3: Check Consumer Performance #

# Check resource usage
kubectl top pod -l app=bonsai-invoice

# Check if pods are being throttled
kubectl describe pod <pod-name> | grep -A 5 "Limits\|Requests"

Step 4: Check Message Processing #

Via Management Console:

  1. Go to Queues tab
  2. Click queue name
  3. View Message rates graph
  4. Check Get messages to inspect message content

Via CLI (local development):

# Requires rabbitmqadmin tool
rabbitmqadmin get queue=invoice.processing count=5

Scaling Consumers #

BonsAI uses KEDA (Kubernetes Event-Driven Autoscaling) to automatically scale consumers based on queue depth.

Check KEDA Configuration #

# View KEDA scaled objects
kubectl get scaledobjects

# Describe specific scaled object
kubectl describe scaledobject bonsai-invoice-scaledobject

# Check scaling configuration
kubectl get scaledobject bonsai-invoice-scaledobject -o yaml

Typical KEDA Configuration:

triggers:
  - type: rabbitmq
    metadata:
      queueName: invoice.processing
      queueLength: "50"   # Scale up when > 50 messages

minReplicaCount: 1        # Minimum pods
maxReplicaCount: 10       # Maximum pods

Manual Scaling #

If KEDA autoscaling isn’t keeping up:

# Manually scale deployment
kubectl scale deployment bonsai-invoice-deployment --replicas=5

# Verify scaling
kubectl get deployment bonsai-invoice-deployment
kubectl get pods -l app=bonsai-invoice

Important: Manual scaling overrides KEDA until the deployment is restarted.

Troubleshooting Consumer Issues #

Consumer Not Processing Messages #

Symptoms: Queue depth growing, consumer pods running but not processing

Investigation:

  1. Check consumer logs

    kubectl logs -l app=bonsai-invoice --tail=100
    
  2. Look for common errors:

    • Database connection failures
    • RabbitMQ connection errors
    • Message parsing errors
    • Resource exhaustion
  3. Test consumer connectivity

    # Exec into consumer pod
    kubectl exec -it <pod-name> -- /bin/bash
    
    # Test RabbitMQ connection
    nc -zv <rabbitmq-host> 5672
    
    # Test database connection
    nc -zv <database-host> 5432
    

Common Solutions:

  1. RabbitMQ connection issue

    • Verify credentials in secrets
    • Check network connectivity
    • Restart consumer pods
  2. Database connection issue

    • Check database availability
    • Verify connection pool limits
    • Scale down consumers if overwhelming DB
  3. Message format issue

    • Check message schema changes
    • Review recent code deployments
    • Inspect failing messages in dead letter queue

Consumer Crashing #

Symptoms: Pods restarting frequently, high restart count

Investigation:

# Check pod status
kubectl get pods -l app=bonsai-invoice

# View previous logs (from crashed container)
kubectl logs <pod-name> --previous

# Check for OOMKilled
kubectl describe pod <pod-name> | grep -A 5 "State\|Last State"

Common Causes:

  1. Out of Memory (OOMKilled)

    • Increase memory limits
    • Fix memory leaks in code
    • Optimize message processing
  2. Unhandled Exception

    • Fix bug in consumer code
    • Add error handling
    • Deploy hotfix
  3. Resource Exhaustion

    • Increase resource limits
    • Optimize resource usage
    • Scale horizontally

Slow Message Processing #

Symptoms: Messages processed slowly, queue depth growing slowly

Investigation:

# Check resource usage
kubectl top pod -l app=bonsai-invoice

# Check processing time in logs
kubectl logs -l app=bonsai-invoice | grep "processing time"

# Check database query performance
# See Database Access runbook

Solutions:

  1. Optimize processing logic

    • Profile slow code paths
    • Optimize database queries
    • Reduce external API calls
  2. Increase resources

    • CPU limits (if throttled)
    • Memory limits (if swapping)
  3. Scale horizontally

    • Add more consumer pods
    • Distribute load

Managing Messages #

Inspecting Messages #

Via Management Console:

  1. Navigate to Queues tab
  2. Click queue name
  3. Scroll to Get messages
  4. Set count and requeue policy
  5. Click Get Message(s)

Message Details:

  • Payload (JSON)
  • Properties (timestamp, message ID)
  • Headers (custom metadata)
  • Routing key

Purging Queue #

WARNING: This deletes all messages in the queue!

Via Management Console:

  1. Navigate to Queues tab
  2. Click queue name
  3. Scroll to Purge Messages
  4. Click Purge Messages button
  5. Confirm action

Use Cases:

  • Clearing test messages in dev
  • Removing invalid messages after bug fix
  • Emergency queue reset (with approval)

Republishing Failed Messages #

Via Management Console:

  1. Get message from dead letter queue
  2. Copy message payload
  3. Go to Exchanges tab
  4. Select appropriate exchange
  5. Click Publish message
  6. Paste payload and configure routing
  7. Click Publish message

Via Code:

For bulk republishing, write a script using the RabbitMQ client library.

Dead Letter Queues #

Checking Dead Letter Queue #

Messages that fail processing may end up in dead letter queues (DLQs).

Find DLQ:

Original queue: invoice.processing
DLQ: invoice.processing.dlq

Via Management Console:

  1. Navigate to Queues tab
  2. Look for queues ending in .dlq
  3. Click queue to view messages

Investigating Failed Messages #

  1. Get message from DLQ

  2. Check message headers for failure reason:

    • x-first-death-reason - Why it was rejected
    • x-first-death-queue - Original queue
    • x-death count - Number of retry attempts
  3. Review consumer logs for the failure timestamp

  4. Identify root cause:

    • Message format issue
    • Business logic error
    • External service failure
    • Resource issue

Reprocessing DLQ Messages #

After fixing the root cause:

  1. Verify fix is deployed
  2. Test with single message first
  3. Move messages back to original queue:
    • Manually republish via console
    • Or use shovel/federation plugins
    • Or write script to republish

Performance Optimization #

Queue Performance Metrics #

Monitor these metrics in Datadog:

  • Message rate - Messages/second published and delivered
  • Consumer utilization - Percentage of time consumers are active
  • Prefetch count - Number of unacked messages per consumer
  • Processing time - Time from publish to ack

Optimizing Consumer Performance #

  1. Adjust Prefetch Count

    • Higher prefetch = Better throughput
    • Lower prefetch = Better load distribution
    • Typical: 10-50 messages
  2. Optimize Message Processing

    • Batch database operations
    • Cache frequently accessed data
    • Use async I/O where possible
  3. Right-Size Resources

    • CPU: Ensure not throttled
    • Memory: Enough headroom for processing
    • Network: Low latency to RabbitMQ and DB
  4. Horizontal Scaling

    • More consumers = More throughput
    • Balance against resource costs
    • Monitor diminishing returns

Queue Configuration #

Best Practices:

  • Durable queues - Survive broker restart
  • Message persistence - Messages written to disk
  • TTL (Time to Live) - Expire old messages
  • Max length - Limit queue depth
  • Dead letter exchange - Handle failures gracefully

Monitoring and Alerts #

Key Metrics to Monitor #

  1. Queue Depth

    • Alert: > 1000 messages
    • Critical: > 5000 messages
  2. Consumer Count

    • Alert: 0 consumers for > 5 minutes
    • Warning: Fewer than expected consumers
  3. Processing Rate

    • Alert: Publish rate » Deliver rate
    • Warning: Processing slower than usual
  4. Message Age

    • Alert: Messages older than 1 hour
    • Critical: Messages older than 4 hours

Setting Up Alerts in Datadog #

# Example Datadog monitor query
avg(last_5m):avg:rabbitmq.queue.messages{queue:invoice.processing} > 1000

Alert Configuration:

  • Warning: > 500 messages
  • Critical: > 1000 messages
  • Recovery: < 100 messages

Troubleshooting RabbitMQ Broker #

Broker Connection Issues #

Symptoms: Consumers unable to connect

Investigation:

# Check Amazon MQ broker status
aws mq describe-broker --broker-id <broker-id>

# Check security groups
aws ec2 describe-security-groups --group-ids <sg-id>

# Test connectivity from consumer pod
kubectl exec -it <pod-name> -- nc -zv <rabbitmq-host> 5672

Solutions:

  • Verify broker is running
  • Check security group rules
  • Verify credentials
  • Check VPC networking

High Memory Usage #

Symptoms: Broker memory alarm, publishers blocked

Investigation:

Via Management Console:

  1. Check Overview page
  2. Look for memory alarms
  3. Check queue memory usage

Solutions:

  • Purge unnecessary messages
  • Increase consumer throughput
  • Add more broker nodes (scale up)
  • Configure queue max length

Best Practices #

Message Design #

  • Keep messages small (< 128 KB)
  • Use message IDs for idempotency
  • Include timestamps
  • Use structured format (JSON)
  • Don’t include large binary data

Consumer Design #

  • Implement retry logic with exponential backoff
  • Handle failures gracefully
  • Acknowledge messages only after successful processing
  • Log processing errors with context
  • Monitor processing time

Operations #

  • Regular monitoring of queue metrics
  • Set up appropriate alerts
  • Scale consumers proactively
  • Test consumer failures in dev
  • Document message schemas
  • Plan for capacity growth

Emergency Procedures #

Queue Completely Stuck #

  1. Stop all producers (if possible)
  2. Check consumer status
  3. Restart consumers if needed
  4. Scale up consumers if backlog is large
  5. Monitor progress
  6. Resume producers when caught up

Broker Unresponsive #

  1. Check broker status in AWS Console
  2. Review CloudWatch metrics
  3. Check for memory/disk issues
  4. Contact AWS Support if broker issue
  5. Consider failover if multi-AZ setup

Data Loss Risk #

If messages are at risk:

  1. Enable persistence if not already
  2. Take broker snapshot (if possible)
  3. Pause producers to stop new messages
  4. Scale up consumers to drain queue quickly
  5. Document message counts before/after

See Also #