Log Management & Search #

This runbook covers how to find, analyze, and troubleshoot using logs across all BonsAI services.

When to Use This Runbook #

Investigating errors or exceptions
Tracing user requests across services (end-to-end with trace IDs)
Debugging production issues
Analyzing performance problems
Audit and compliance reviews
Following a request from frontend through backend to LLM

Log Architecture #

BonsAI logs are collected and stored in multiple locations:

Application Logs
    ↓
Container stdout/stderr
    ↓
Fluent Bit (log collector)
    ↓
┌─────────────┬─────────────┬─────────────┐
│  CloudWatch │   Datadog   │  S3 Archive │
│  (3 days)   │  (15 days)  │  (365 days) │
└─────────────┴─────────────┴─────────────┘

Log Locations #

Kubernetes Pods (Real-time) #

# View logs for a specific pod
kubectl logs <pod-name>

# View logs from all pods with a label
kubectl logs -l app=bonsapi --tail=100

# Follow logs in real-time
kubectl logs -f <pod-name>

# View logs from previous container (if crashed)
kubectl logs <pod-name> --previous

# Logs from specific container in multi-container pod
kubectl logs <pod-name> -c <container-name>

CloudWatch Logs (Short-term) #

Log Groups:

/eks/prod/pods - Application logs from all pods
/eks/dev/pods - Development environment logs
/aws/eks/bonsai-app-eks-cluster-prod/cluster - EKS control plane logs

Retention: 3 days (production), then archived to S3

Access:

AWS Console → CloudWatch → Log groups
Select log group
Filter by log stream (pod name)

Datadog (Medium-term) #

Retention: 15 days with full search capabilities

Access:

Go to Datadog → Logs
Use search filters
View logs by service, time, severity

S3 Archive (Long-term) #

Retention: 365 days for compliance

Location:

s3://bonsai-logs-prod/application-log/
s3://bonsai-logs-dev/application-log/

Access:

# List recent log files
aws s3 ls s3://bonsai-logs-prod/application-log/ --recursive | tail -20

# Download specific log file
aws s3 cp s3://bonsai-logs-prod/application-log/2025/10/21/log-file.gz ./

# Uncompress and view
gunzip log-file.gz
cat log-file | less

Searching Logs #

Using kubectl #

Basic log viewing:

# Last 100 lines
kubectl logs <pod-name> --tail=100

# Since timestamp
kubectl logs <pod-name> --since-time=2025-10-21T10:00:00Z

# Since duration
kubectl logs <pod-name> --since=1h

# All pods with label
kubectl logs -l app=bonsapi --tail=50

Filtering with grep:

# Find errors
kubectl logs <pod-name> | grep -i error

# Find specific user activity
kubectl logs <pod-name> | grep "user_id:123"

# Find by request ID
kubectl logs <pod-name> | grep "request_id:abc-123"

# Multiple filters
kubectl logs <pod-name> | grep -i error | grep "invoice"

Using CloudWatch Insights #

CloudWatch Insights provides SQL-like querying for logs.

Access:

AWS Console → CloudWatch → Logs → Insights
Select log group: /eks/prod/pods
Enter query
Select time range
Run query

Example Queries:

# Find all errors
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

# Find errors for specific service
fields @timestamp, @message, @logStream
| filter @logStream like /bonsapi/
| filter @message like /ERROR|FATAL/
| sort @timestamp desc

# Find by user ID
fields @timestamp, @message
| filter @message like /user_id:12345/
| sort @timestamp desc

# Aggregate errors by type
fields @timestamp, @message
| filter @message like /ERROR/
| parse @message /ERROR: (?<error_type>.*?) -/
| stats count() by error_type
| sort count desc

# API response times
fields @timestamp, responseTime
| filter ispresent(responseTime)
| stats avg(responseTime), max(responseTime), min(responseTime) by bin(5m)

# Failed requests count
fields @timestamp
| filter status >= 500
| stats count() by bin(1m)

Using Datadog #

Basic search:

Navigate to Logs in Datadog
Use search bar with filters:

# By service
service:bonsapi

# By status
status:error

# By time range (use picker)

# Combined filters
service:bonsapi status:error @http.status_code:500

# By user
@user.id:12345

# By request ID
@request_id:abc-123

Advanced search syntax:

# Exact match
@message:"database connection failed"

# Wildcard
@message:*connection*failed*

# Range
@http.response_time:[1000 TO 5000]

# Exists
@user.id:*

# Boolean
service:bonsapi AND status:error

# Exclude
service:bonsapi -status:info

Saved views:

Create saved views for common searches:

Error logs by service
High latency requests
Failed authentication attempts
Queue processing errors

Log Patterns and Formats #

BonsAPI (Rust) #

Format: [TIMESTAMP] [LEVEL] [MODULE] message
Example: [2025-10-21T10:15:30Z] [ERROR] [bonsapi::service::invoice] Failed to process invoice: Connection timeout

Key fields:

level - DEBUG, INFO, WARN, ERROR
module - Rust module path
request_id - Request correlation ID
user_id - User identifier

Webapp (Next.js) #

Format: JSON structured logs
Example:
{
  "timestamp": "2025-10-21T10:15:30.123Z",
  "level": "error",
  "service": "bonsai-webapp",
  "message": "Failed to fetch user data",
  "data": {
    "user_id": "123",
    "error": "Network timeout"
  }
}

Key fields:

level - debug, info, warn, error
service - bonsai-webapp
ddsource - nextjs
request_id - Request correlation ID
data - Additional context

Python Services (Invoice, Knowledge) #

Format: [TIMESTAMP] [LEVEL] [LOGGER] message
Example: [2025-10-21 10:15:30] [ERROR] [bonsai_invoice.jobs.document] Document processing failed: OCR timeout

Key fields:

level - DEBUG, INFO, WARNING, ERROR, CRITICAL
logger - Python logger name
document_id - Document identifier
job_id - Job identifier

Distributed Tracing with Trace IDs #

BonsAI implements end-to-end distributed tracing using trace IDs that flow from the frontend through all backend services to LLM calls.

How Trace IDs Work #

User Action (Frontend)
    ↓
Generate X-TOFU-TRACE-ID (UUID v4)
    ↓
API Request with X-TOFU-TRACE-ID header
    ↓
BonsAPI receives and logs trace_id
    ↓
RabbitMQ message includes trace_id
    ↓
Worker services (bonsai-invoice, etc.) log trace_id
    ↓
LLM calls tracked with same trace_id
    ↓
Datadog LLM Observability links everything

Trace ID Flow:

Frontend (Webapp) - Generates trace ID and adds header

// apps/webapp/src/shared/lib/api/axios-instance.ts
const traceId = v4(); // UUID v4
config.headers['X-TOFU-TRACE-ID'] = traceId;

Backend (BonsAPI) - Extracts and logs trace ID

// Rust services use bonsai-utils to get current trace
let trace_id = current_trace_id();

Workers (Python) - Includes trace ID in all operations

# Python services use bonsai_utils.logging
from bonsai_utils.logging import current_trace_id
trace_id = current_trace_id()

LLM Calls - Annotated with trace ID in Datadog

# hinoki/utils/decorators/llm_obs/decorator.py
LLMObs.annotate(input_data={"trace_id": trace_id})

Tracing End-to-End Requests #

Scenario: User uploads invoice → API processes → Worker extracts data → LLM analyzes

Step 1: Get the Trace ID

From frontend logs or API response headers:

# Example trace ID
X-TOFU-TRACE-ID: 550e8400-e29b-41d4-a716-446655440000

Step 2: Search All Services with Trace ID

# Datadog (RECOMMENDED - searches all services)
@trace_id:550e8400-e29b-41d4-a716-446655440000

# kubectl (real-time logs)
kubectl logs -l app=bonsapi | grep "550e8400-e29b-41d4-a716-446655440000"
kubectl logs -l app=bonsai-invoice | grep "550e8400-e29b-41d4-a716-446655440000"

# CloudWatch Insights
fields @timestamp, @message, @logStream
| filter @message like /550e8400-e29b-41d4-a716-446655440000/
| sort @timestamp asc

Step 3: View Complete Timeline

In Datadog:

Search for trace ID: @trace_id:550e8400-e29b-41d4-a716-446655440000
Sort by timestamp (ascending)

See complete request flow:

10:15:30.123 [bonsai-webapp] User clicked "Upload Invoice"
10:15:30.234 [bonsapi] POST /api/v1/documents - trace_id: 550e8400...
10:15:30.456 [bonsapi] Published to RabbitMQ queue: invoice.processing
10:15:31.789 [bonsai-invoice] Received message from queue
10:15:32.012 [bonsai-invoice] Starting document extraction
10:15:35.678 [bonsai-invoice] Calling LLM for analysis
10:15:38.901 [bonsai-invoice] LLM response received
10:15:39.123 [bonsapi] Document processing completed

Step 4: Check LLM Observability

For LLM calls in the trace:

Navigate to Datadog → LLM Observability
Search by trace ID
View LLM-specific metrics (see LLM Observability)

Common Log Searches #

Tracing a Request by Trace ID (RECOMMENDED) #

Find all logs for a specific request using trace ID:

# Datadog (best option - searches everything)
@trace_id:550e8400-e29b-41d4-a716-446655440000

# kubectl (real-time)
kubectl logs -l app=bonsapi | grep "550e8400-e29b-41d4-a716-446655440000"

# CloudWatch Insights
fields @timestamp, @message, @logStream
| filter @message like /550e8400-e29b-41d4-a716-446655440000/
| sort @timestamp asc

Legacy Request ID Search #

For older logs without trace IDs:

# kubectl
kubectl logs -l app=bonsapi | grep "request_id:abc-123"

# Datadog
@request_id:abc-123

# CloudWatch Insights
fields @timestamp, @message, @logStream
| filter @message like /request_id:abc-123/
| sort @timestamp asc

Finding Errors #

All errors in last hour:

# kubectl (live)
kubectl logs -l app=bonsapi --since=1h | grep -i error

# Datadog
service:bonsapi status:error @timestamp:[now-1h TO now]

# CloudWatch
fields @timestamp, @message
| filter @message like /ERROR|FATAL/
| filter @timestamp >= now() - 1h
| sort @timestamp desc

User Activity Logs #

Track specific user actions:

# kubectl
kubectl logs -l app=bonsapi | grep "user_id:12345"

# Datadog
@user.id:12345

# CloudWatch
fields @timestamp, @message
| filter @message like /user_id:12345/
| sort @timestamp asc

Document Processing Logs #

Find logs for specific document:

# kubectl
kubectl logs -l app=bonsai-invoice | grep "document_id:doc-123"

# Datadog
service:bonsai-invoice @document_id:doc-123

# CloudWatch
fields @timestamp, @message
| filter @logStream like /bonsai-invoice/
| filter @message like /document_id:doc-123/
| sort @timestamp asc

Database Query Logs #

Find slow database queries:

# Datadog
service:bonsapi @db.statement_duration:>1000

# CloudWatch
fields @timestamp, @message
| filter @message like /slow query/
| filter @message like /duration/
| sort @timestamp desc

Troubleshooting with Logs #

Investigating 500 Errors #

Find the error logs

kubectl logs -l app=bonsapi | grep "500\|ERROR" | tail -20

Get the request ID from the error log

Trace the full request

kubectl logs -l app=bonsapi | grep "request_id:<id>"

Check stack trace for root cause
Verify database/external service logs

Debugging Slow Performance #

Find slow requests

# Datadog
@http.response_time:>2000 service:bonsapi

Analyze patterns
- Which endpoints are slow?
- What time of day?
- Specific users or all users?
Check resource usage
```
kubectl top pods
```
Review database query performance

Authentication Issues #

Find auth failures

# Datadog
service:bonsapi @message:*authentication*failed*

Check for patterns
- Specific users?
- Specific auth method?
- Timing correlation?
Verify Clerk service status
Check JWT validation logs

Log Management Best Practices #

What to Log #

DO log:

Request/response metadata (IDs, timestamps)
Errors and exceptions with context
Important business events (document processed, payment completed)
Performance metrics (query time, processing duration)
Security events (auth failures, permission denials)

DON’T log:

Passwords or secrets
Credit card numbers or PII
Large binary data
Excessive debug logs in production

Log Levels #

Use appropriate log levels:

DEBUG - Detailed development information (dev only)
INFO - General informational messages
WARN - Warning messages, potential issues
ERROR - Error messages, handled exceptions
FATAL/CRITICAL - Critical errors, service shutdown

Structured Logging #

Use structured logs (JSON) for easy parsing:

// Good
logger.info('User logged in', {
  user_id: '123',
  ip_address: '1.2.3.4',
  auth_method: 'oauth'
});

// Bad
logger.info(`User 123 logged in from 1.2.3.4 via oauth`);

Log Context #

Include correlation IDs for tracing:

// Rust example
tracing::info!(
    request_id = %request.id,
    user_id = %user.id,
    "Processing invoice"
);

Exporting Logs #

Download from CloudWatch #

# Using AWS CLI
aws logs filter-log-events \
  --log-group-name /eks/prod/pods \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --output json > logs.json

Download from S3 Archive #

# List files for specific date
aws s3 ls s3://bonsai-logs-prod/application-log/2025/10/21/

# Download specific log file
aws s3 cp s3://bonsai-logs-prod/application-log/2025/10/21/log-file.gz ./

# Download entire day
aws s3 sync s3://bonsai-logs-prod/application-log/2025/10/21/ ./logs/

Export from Datadog #

Navigate to Logs in Datadog
Apply filters for desired logs
Click Export → Download as CSV or JSON
Select time range and fields

Log Retention & Compliance #

Retention Policies #

Location	Retention	Purpose
Kubernetes	Real-time only	Active debugging
CloudWatch	3 days	Recent troubleshooting
Datadog	15 days	Search and analysis
S3 Archive	365 days	Compliance, auditing

Compliance Considerations #

GDPR - User data can be requested and deleted
SOC 2 - Audit logs must be retained
Access control - Logs may contain sensitive data
Data privacy - PII should be masked or encrypted

Log Management & Search #

When to Use This Runbook #

Log Architecture #

Log Locations #

Kubernetes Pods (Real-time) #

CloudWatch Logs (Short-term) #

Datadog (Medium-term) #

S3 Archive (Long-term) #

Searching Logs #

Using kubectl #

Using CloudWatch Insights #

Using Datadog #

Log Patterns and Formats #

BonsAPI (Rust) #

Webapp (Next.js) #

Python Services (Invoice, Knowledge) #

Distributed Tracing with Trace IDs #

How Trace IDs Work #

Tracing End-to-End Requests #

Common Log Searches #

Tracing a Request by Trace ID (RECOMMENDED) #

Legacy Request ID Search #

Finding Errors #

User Activity Logs #

Document Processing Logs #

Database Query Logs #

Troubleshooting with Logs #

Investigating 500 Errors #

Debugging Slow Performance #

Authentication Issues #

Log Management Best Practices #

What to Log #

Log Levels #

Structured Logging #

Log Context #

Exporting Logs #

Download from CloudWatch #

Download from S3 Archive #

Export from Datadog #

Log Retention & Compliance #

Retention Policies #

Compliance Considerations #

See Also #