Log Management & Search

Log Management & Search #

This runbook covers how to find, analyze, and troubleshoot using logs across all BonsAI services.

When to Use This Runbook #

  • Investigating errors or exceptions
  • Tracing user requests across services (end-to-end with trace IDs)
  • Debugging production issues
  • Analyzing performance problems
  • Audit and compliance reviews
  • Following a request from frontend through backend to LLM

Log Architecture #

BonsAI logs are collected and stored in multiple locations:

Application Logs
    ↓
Container stdout/stderr
    ↓
Fluent Bit (log collector)
    ↓
┌─────────────┬─────────────┬─────────────┐
│  CloudWatch │   Datadog   │  S3 Archive │
│  (3 days)   │  (15 days)  │  (365 days) │
└─────────────┴─────────────┴─────────────┘

Log Locations #

Kubernetes Pods (Real-time) #

# View logs for a specific pod
kubectl logs <pod-name>

# View logs from all pods with a label
kubectl logs -l app=bonsapi --tail=100

# Follow logs in real-time
kubectl logs -f <pod-name>

# View logs from previous container (if crashed)
kubectl logs <pod-name> --previous

# Logs from specific container in multi-container pod
kubectl logs <pod-name> -c <container-name>

CloudWatch Logs (Short-term) #

Log Groups:

  • /eks/prod/pods - Application logs from all pods
  • /eks/dev/pods - Development environment logs
  • /aws/eks/bonsai-app-eks-cluster-prod/cluster - EKS control plane logs

Retention: 3 days (production), then archived to S3

Access:

  1. AWS Console → CloudWatch → Log groups
  2. Select log group
  3. Filter by log stream (pod name)

Datadog (Medium-term) #

Retention: 15 days with full search capabilities

Access:

  1. Go to Datadog → Logs
  2. Use search filters
  3. View logs by service, time, severity

S3 Archive (Long-term) #

Retention: 365 days for compliance

Location:

  • s3://bonsai-logs-prod/application-log/
  • s3://bonsai-logs-dev/application-log/

Access:

# List recent log files
aws s3 ls s3://bonsai-logs-prod/application-log/ --recursive | tail -20

# Download specific log file
aws s3 cp s3://bonsai-logs-prod/application-log/2025/10/21/log-file.gz ./

# Uncompress and view
gunzip log-file.gz
cat log-file | less

Searching Logs #

Using kubectl #

Basic log viewing:

# Last 100 lines
kubectl logs <pod-name> --tail=100

# Since timestamp
kubectl logs <pod-name> --since-time=2025-10-21T10:00:00Z

# Since duration
kubectl logs <pod-name> --since=1h

# All pods with label
kubectl logs -l app=bonsapi --tail=50

Filtering with grep:

# Find errors
kubectl logs <pod-name> | grep -i error

# Find specific user activity
kubectl logs <pod-name> | grep "user_id:123"

# Find by request ID
kubectl logs <pod-name> | grep "request_id:abc-123"

# Multiple filters
kubectl logs <pod-name> | grep -i error | grep "invoice"

Using CloudWatch Insights #

CloudWatch Insights provides SQL-like querying for logs.

Access:

  1. AWS Console → CloudWatch → Logs → Insights
  2. Select log group: /eks/prod/pods
  3. Enter query
  4. Select time range
  5. Run query

Example Queries:

# Find all errors
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# Find errors for specific service
fields @timestamp, @message, @logStream
| filter @logStream like /bonsapi/
| filter @message like /ERROR|FATAL/
| sort @timestamp desc

# Find by user ID
fields @timestamp, @message
| filter @message like /user_id:12345/
| sort @timestamp desc

# Aggregate errors by type
fields @timestamp, @message
| filter @message like /ERROR/
| parse @message /ERROR: (?<error_type>.*?) -/
| stats count() by error_type
| sort count desc

# API response times
fields @timestamp, responseTime
| filter ispresent(responseTime)
| stats avg(responseTime), max(responseTime), min(responseTime) by bin(5m)

# Failed requests count
fields @timestamp
| filter status >= 500
| stats count() by bin(1m)

Using Datadog #

Basic search:

  1. Navigate to Logs in Datadog
  2. Use search bar with filters:
# By service
service:bonsapi

# By status
status:error

# By time range (use picker)

# Combined filters
service:bonsapi status:error @http.status_code:500

# By user
@user.id:12345

# By request ID
@request_id:abc-123

Advanced search syntax:

# Exact match
@message:"database connection failed"

# Wildcard
@message:*connection*failed*

# Range
@http.response_time:[1000 TO 5000]

# Exists
@user.id:*

# Boolean
service:bonsapi AND status:error

# Exclude
service:bonsapi -status:info

Saved views:

Create saved views for common searches:

  • Error logs by service
  • High latency requests
  • Failed authentication attempts
  • Queue processing errors

Log Patterns and Formats #

BonsAPI (Rust) #

Format: [TIMESTAMP] [LEVEL] [MODULE] message
Example: [2025-10-21T10:15:30Z] [ERROR] [bonsapi::service::invoice] Failed to process invoice: Connection timeout

Key fields:

  • level - DEBUG, INFO, WARN, ERROR
  • module - Rust module path
  • request_id - Request correlation ID
  • user_id - User identifier

Webapp (Next.js) #

Format: JSON structured logs
Example:
{
  "timestamp": "2025-10-21T10:15:30.123Z",
  "level": "error",
  "service": "bonsai-webapp",
  "message": "Failed to fetch user data",
  "data": {
    "user_id": "123",
    "error": "Network timeout"
  }
}

Key fields:

  • level - debug, info, warn, error
  • service - bonsai-webapp
  • ddsource - nextjs
  • request_id - Request correlation ID
  • data - Additional context

Python Services (Invoice, Knowledge) #

Format: [TIMESTAMP] [LEVEL] [LOGGER] message
Example: [2025-10-21 10:15:30] [ERROR] [bonsai_invoice.jobs.document] Document processing failed: OCR timeout

Key fields:

  • level - DEBUG, INFO, WARNING, ERROR, CRITICAL
  • logger - Python logger name
  • document_id - Document identifier
  • job_id - Job identifier

Distributed Tracing with Trace IDs #

BonsAI implements end-to-end distributed tracing using trace IDs that flow from the frontend through all backend services to LLM calls.

How Trace IDs Work #

User Action (Frontend)
    ↓
Generate X-TOFU-TRACE-ID (UUID v4)
    ↓
API Request with X-TOFU-TRACE-ID header
    ↓
BonsAPI receives and logs trace_id
    ↓
RabbitMQ message includes trace_id
    ↓
Worker services (bonsai-invoice, etc.) log trace_id
    ↓
LLM calls tracked with same trace_id
    ↓
Datadog LLM Observability links everything

Trace ID Flow:

  1. Frontend (Webapp) - Generates trace ID and adds header

    // apps/webapp/src/shared/lib/api/axios-instance.ts
    const traceId = v4(); // UUID v4
    config.headers['X-TOFU-TRACE-ID'] = traceId;
    
  2. Backend (BonsAPI) - Extracts and logs trace ID

    // Rust services use bonsai-utils to get current trace
    let trace_id = current_trace_id();
    
  3. Workers (Python) - Includes trace ID in all operations

    # Python services use bonsai_utils.logging
    from bonsai_utils.logging import current_trace_id
    trace_id = current_trace_id()
    
  4. LLM Calls - Annotated with trace ID in Datadog

    # hinoki/utils/decorators/llm_obs/decorator.py
    LLMObs.annotate(input_data={"trace_id": trace_id})
    

Tracing End-to-End Requests #

Scenario: User uploads invoice → API processes → Worker extracts data → LLM analyzes

Step 1: Get the Trace ID

From frontend logs or API response headers:

# Example trace ID
X-TOFU-TRACE-ID: 550e8400-e29b-41d4-a716-446655440000

Step 2: Search All Services with Trace ID

# Datadog (RECOMMENDED - searches all services)
@trace_id:550e8400-e29b-41d4-a716-446655440000

# kubectl (real-time logs)
kubectl logs -l app=bonsapi | grep "550e8400-e29b-41d4-a716-446655440000"
kubectl logs -l app=bonsai-invoice | grep "550e8400-e29b-41d4-a716-446655440000"

# CloudWatch Insights
fields @timestamp, @message, @logStream
| filter @message like /550e8400-e29b-41d4-a716-446655440000/
| sort @timestamp asc

Step 3: View Complete Timeline

In Datadog:

  1. Search for trace ID: @trace_id:550e8400-e29b-41d4-a716-446655440000
  2. Sort by timestamp (ascending)
  3. See complete request flow:
    10:15:30.123 [bonsai-webapp] User clicked "Upload Invoice"
    10:15:30.234 [bonsapi] POST /api/v1/documents - trace_id: 550e8400...
    10:15:30.456 [bonsapi] Published to RabbitMQ queue: invoice.processing
    10:15:31.789 [bonsai-invoice] Received message from queue
    10:15:32.012 [bonsai-invoice] Starting document extraction
    10:15:35.678 [bonsai-invoice] Calling LLM for analysis
    10:15:38.901 [bonsai-invoice] LLM response received
    10:15:39.123 [bonsapi] Document processing completed
    

Step 4: Check LLM Observability

For LLM calls in the trace:

  1. Navigate to Datadog → LLM Observability
  2. Search by trace ID
  3. View LLM-specific metrics (see LLM Observability)

Common Log Searches #

Find all logs for a specific request using trace ID:

# Datadog (best option - searches everything)
@trace_id:550e8400-e29b-41d4-a716-446655440000

# kubectl (real-time)
kubectl logs -l app=bonsapi | grep "550e8400-e29b-41d4-a716-446655440000"

# CloudWatch Insights
fields @timestamp, @message, @logStream
| filter @message like /550e8400-e29b-41d4-a716-446655440000/
| sort @timestamp asc

For older logs without trace IDs:

# kubectl
kubectl logs -l app=bonsapi | grep "request_id:abc-123"

# Datadog
@request_id:abc-123

# CloudWatch Insights
fields @timestamp, @message, @logStream
| filter @message like /request_id:abc-123/
| sort @timestamp asc

Finding Errors #

All errors in last hour:

# kubectl (live)
kubectl logs -l app=bonsapi --since=1h | grep -i error

# Datadog
service:bonsapi status:error @timestamp:[now-1h TO now]

# CloudWatch
fields @timestamp, @message
| filter @message like /ERROR|FATAL/
| filter @timestamp >= now() - 1h
| sort @timestamp desc

User Activity Logs #

Track specific user actions:

# kubectl
kubectl logs -l app=bonsapi | grep "user_id:12345"

# Datadog
@user.id:12345

# CloudWatch
fields @timestamp, @message
| filter @message like /user_id:12345/
| sort @timestamp asc

Document Processing Logs #

Find logs for specific document:

# kubectl
kubectl logs -l app=bonsai-invoice | grep "document_id:doc-123"

# Datadog
service:bonsai-invoice @document_id:doc-123

# CloudWatch
fields @timestamp, @message
| filter @logStream like /bonsai-invoice/
| filter @message like /document_id:doc-123/
| sort @timestamp asc

Database Query Logs #

Find slow database queries:

# Datadog
service:bonsapi @db.statement_duration:>1000

# CloudWatch
fields @timestamp, @message
| filter @message like /slow query/
| filter @message like /duration/
| sort @timestamp desc

Troubleshooting with Logs #

Investigating 500 Errors #

  1. Find the error logs

    kubectl logs -l app=bonsapi | grep "500\|ERROR" | tail -20
    
  2. Get the request ID from the error log

  3. Trace the full request

    kubectl logs -l app=bonsapi | grep "request_id:<id>"
    
  4. Check stack trace for root cause

  5. Verify database/external service logs

Debugging Slow Performance #

  1. Find slow requests

    # Datadog
    @http.response_time:>2000 service:bonsapi
    
  2. Analyze patterns

    • Which endpoints are slow?
    • What time of day?
    • Specific users or all users?
  3. Check resource usage

    kubectl top pods
    
  4. Review database query performance

Authentication Issues #

  1. Find auth failures

    # Datadog
    service:bonsapi @message:*authentication*failed*
    
  2. Check for patterns

    • Specific users?
    • Specific auth method?
    • Timing correlation?
  3. Verify Clerk service status

  4. Check JWT validation logs

Log Management Best Practices #

What to Log #

DO log:

  • Request/response metadata (IDs, timestamps)
  • Errors and exceptions with context
  • Important business events (document processed, payment completed)
  • Performance metrics (query time, processing duration)
  • Security events (auth failures, permission denials)

DON’T log:

  • Passwords or secrets
  • Credit card numbers or PII
  • Large binary data
  • Excessive debug logs in production

Log Levels #

Use appropriate log levels:

  • DEBUG - Detailed development information (dev only)
  • INFO - General informational messages
  • WARN - Warning messages, potential issues
  • ERROR - Error messages, handled exceptions
  • FATAL/CRITICAL - Critical errors, service shutdown

Structured Logging #

Use structured logs (JSON) for easy parsing:

// Good
logger.info('User logged in', {
  user_id: '123',
  ip_address: '1.2.3.4',
  auth_method: 'oauth'
});

// Bad
logger.info(`User 123 logged in from 1.2.3.4 via oauth`);

Log Context #

Include correlation IDs for tracing:

// Rust example
tracing::info!(
    request_id = %request.id,
    user_id = %user.id,
    "Processing invoice"
);

Exporting Logs #

Download from CloudWatch #

# Using AWS CLI
aws logs filter-log-events \
  --log-group-name /eks/prod/pods \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --output json > logs.json

Download from S3 Archive #

# List files for specific date
aws s3 ls s3://bonsai-logs-prod/application-log/2025/10/21/

# Download specific log file
aws s3 cp s3://bonsai-logs-prod/application-log/2025/10/21/log-file.gz ./

# Download entire day
aws s3 sync s3://bonsai-logs-prod/application-log/2025/10/21/ ./logs/

Export from Datadog #

  1. Navigate to Logs in Datadog
  2. Apply filters for desired logs
  3. Click ExportDownload as CSV or JSON
  4. Select time range and fields

Log Retention & Compliance #

Retention Policies #

Location Retention Purpose
Kubernetes Real-time only Active debugging
CloudWatch 3 days Recent troubleshooting
Datadog 15 days Search and analysis
S3 Archive 365 days Compliance, auditing

Compliance Considerations #

  • GDPR - User data can be requested and deleted
  • SOC 2 - Audit logs must be retained
  • Access control - Logs may contain sensitive data
  • Data privacy - PII should be masked or encrypted

See Also #