Log Management & Search #
This runbook covers how to find, analyze, and troubleshoot using logs across all BonsAI services.
When to Use This Runbook #
- Investigating errors or exceptions
- Tracing user requests across services (end-to-end with trace IDs)
- Debugging production issues
- Analyzing performance problems
- Audit and compliance reviews
- Following a request from frontend through backend to LLM
Log Architecture #
BonsAI logs are collected and stored in multiple locations:
Application Logs
↓
Container stdout/stderr
↓
Fluent Bit (log collector)
↓
┌─────────────┬─────────────┬─────────────┐
│ CloudWatch │ Datadog │ S3 Archive │
│ (3 days) │ (15 days) │ (365 days) │
└─────────────┴─────────────┴─────────────┘
Log Locations #
Kubernetes Pods (Real-time) #
# View logs for a specific pod
kubectl logs <pod-name>
# View logs from all pods with a label
kubectl logs -l app=bonsapi --tail=100
# Follow logs in real-time
kubectl logs -f <pod-name>
# View logs from previous container (if crashed)
kubectl logs <pod-name> --previous
# Logs from specific container in multi-container pod
kubectl logs <pod-name> -c <container-name>
CloudWatch Logs (Short-term) #
Log Groups:
/eks/prod/pods- Application logs from all pods/eks/dev/pods- Development environment logs/aws/eks/bonsai-app-eks-cluster-prod/cluster- EKS control plane logs
Retention: 3 days (production), then archived to S3
Access:
- AWS Console → CloudWatch → Log groups
- Select log group
- Filter by log stream (pod name)
Datadog (Medium-term) #
Retention: 15 days with full search capabilities
Access:
- Go to Datadog → Logs
- Use search filters
- View logs by service, time, severity
S3 Archive (Long-term) #
Retention: 365 days for compliance
Location:
s3://bonsai-logs-prod/application-log/s3://bonsai-logs-dev/application-log/
Access:
# List recent log files
aws s3 ls s3://bonsai-logs-prod/application-log/ --recursive | tail -20
# Download specific log file
aws s3 cp s3://bonsai-logs-prod/application-log/2025/10/21/log-file.gz ./
# Uncompress and view
gunzip log-file.gz
cat log-file | less
Searching Logs #
Using kubectl #
Basic log viewing:
# Last 100 lines
kubectl logs <pod-name> --tail=100
# Since timestamp
kubectl logs <pod-name> --since-time=2025-10-21T10:00:00Z
# Since duration
kubectl logs <pod-name> --since=1h
# All pods with label
kubectl logs -l app=bonsapi --tail=50
Filtering with grep:
# Find errors
kubectl logs <pod-name> | grep -i error
# Find specific user activity
kubectl logs <pod-name> | grep "user_id:123"
# Find by request ID
kubectl logs <pod-name> | grep "request_id:abc-123"
# Multiple filters
kubectl logs <pod-name> | grep -i error | grep "invoice"
Using CloudWatch Insights #
CloudWatch Insights provides SQL-like querying for logs.
Access:
- AWS Console → CloudWatch → Logs → Insights
- Select log group:
/eks/prod/pods - Enter query
- Select time range
- Run query
Example Queries:
# Find all errors
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
# Find errors for specific service
fields @timestamp, @message, @logStream
| filter @logStream like /bonsapi/
| filter @message like /ERROR|FATAL/
| sort @timestamp desc
# Find by user ID
fields @timestamp, @message
| filter @message like /user_id:12345/
| sort @timestamp desc
# Aggregate errors by type
fields @timestamp, @message
| filter @message like /ERROR/
| parse @message /ERROR: (?<error_type>.*?) -/
| stats count() by error_type
| sort count desc
# API response times
fields @timestamp, responseTime
| filter ispresent(responseTime)
| stats avg(responseTime), max(responseTime), min(responseTime) by bin(5m)
# Failed requests count
fields @timestamp
| filter status >= 500
| stats count() by bin(1m)
Using Datadog #
Basic search:
- Navigate to Logs in Datadog
- Use search bar with filters:
# By service
service:bonsapi
# By status
status:error
# By time range (use picker)
# Combined filters
service:bonsapi status:error @http.status_code:500
# By user
@user.id:12345
# By request ID
@request_id:abc-123
Advanced search syntax:
# Exact match
@message:"database connection failed"
# Wildcard
@message:*connection*failed*
# Range
@http.response_time:[1000 TO 5000]
# Exists
@user.id:*
# Boolean
service:bonsapi AND status:error
# Exclude
service:bonsapi -status:info
Saved views:
Create saved views for common searches:
- Error logs by service
- High latency requests
- Failed authentication attempts
- Queue processing errors
Log Patterns and Formats #
BonsAPI (Rust) #
Format: [TIMESTAMP] [LEVEL] [MODULE] message
Example: [2025-10-21T10:15:30Z] [ERROR] [bonsapi::service::invoice] Failed to process invoice: Connection timeout
Key fields:
level- DEBUG, INFO, WARN, ERRORmodule- Rust module pathrequest_id- Request correlation IDuser_id- User identifier
Webapp (Next.js) #
Format: JSON structured logs
Example:
{
"timestamp": "2025-10-21T10:15:30.123Z",
"level": "error",
"service": "bonsai-webapp",
"message": "Failed to fetch user data",
"data": {
"user_id": "123",
"error": "Network timeout"
}
}
Key fields:
level- debug, info, warn, errorservice- bonsai-webappddsource- nextjsrequest_id- Request correlation IDdata- Additional context
Python Services (Invoice, Knowledge) #
Format: [TIMESTAMP] [LEVEL] [LOGGER] message
Example: [2025-10-21 10:15:30] [ERROR] [bonsai_invoice.jobs.document] Document processing failed: OCR timeout
Key fields:
level- DEBUG, INFO, WARNING, ERROR, CRITICALlogger- Python logger namedocument_id- Document identifierjob_id- Job identifier
Distributed Tracing with Trace IDs #
BonsAI implements end-to-end distributed tracing using trace IDs that flow from the frontend through all backend services to LLM calls.
How Trace IDs Work #
User Action (Frontend)
↓
Generate X-TOFU-TRACE-ID (UUID v4)
↓
API Request with X-TOFU-TRACE-ID header
↓
BonsAPI receives and logs trace_id
↓
RabbitMQ message includes trace_id
↓
Worker services (bonsai-invoice, etc.) log trace_id
↓
LLM calls tracked with same trace_id
↓
Datadog LLM Observability links everything
Trace ID Flow:
-
Frontend (Webapp) - Generates trace ID and adds header
// apps/webapp/src/shared/lib/api/axios-instance.ts const traceId = v4(); // UUID v4 config.headers['X-TOFU-TRACE-ID'] = traceId; -
Backend (BonsAPI) - Extracts and logs trace ID
// Rust services use bonsai-utils to get current trace let trace_id = current_trace_id(); -
Workers (Python) - Includes trace ID in all operations
# Python services use bonsai_utils.logging from bonsai_utils.logging import current_trace_id trace_id = current_trace_id() -
LLM Calls - Annotated with trace ID in Datadog
# hinoki/utils/decorators/llm_obs/decorator.py LLMObs.annotate(input_data={"trace_id": trace_id})
Tracing End-to-End Requests #
Scenario: User uploads invoice → API processes → Worker extracts data → LLM analyzes
Step 1: Get the Trace ID
From frontend logs or API response headers:
# Example trace ID
X-TOFU-TRACE-ID: 550e8400-e29b-41d4-a716-446655440000
Step 2: Search All Services with Trace ID
# Datadog (RECOMMENDED - searches all services)
@trace_id:550e8400-e29b-41d4-a716-446655440000
# kubectl (real-time logs)
kubectl logs -l app=bonsapi | grep "550e8400-e29b-41d4-a716-446655440000"
kubectl logs -l app=bonsai-invoice | grep "550e8400-e29b-41d4-a716-446655440000"
# CloudWatch Insights
fields @timestamp, @message, @logStream
| filter @message like /550e8400-e29b-41d4-a716-446655440000/
| sort @timestamp asc
Step 3: View Complete Timeline
In Datadog:
- Search for trace ID:
@trace_id:550e8400-e29b-41d4-a716-446655440000 - Sort by timestamp (ascending)
- See complete request flow:
10:15:30.123 [bonsai-webapp] User clicked "Upload Invoice" 10:15:30.234 [bonsapi] POST /api/v1/documents - trace_id: 550e8400... 10:15:30.456 [bonsapi] Published to RabbitMQ queue: invoice.processing 10:15:31.789 [bonsai-invoice] Received message from queue 10:15:32.012 [bonsai-invoice] Starting document extraction 10:15:35.678 [bonsai-invoice] Calling LLM for analysis 10:15:38.901 [bonsai-invoice] LLM response received 10:15:39.123 [bonsapi] Document processing completed
Step 4: Check LLM Observability
For LLM calls in the trace:
- Navigate to Datadog → LLM Observability
- Search by trace ID
- View LLM-specific metrics (see LLM Observability)
Common Log Searches #
Tracing a Request by Trace ID (RECOMMENDED) #
Find all logs for a specific request using trace ID:
# Datadog (best option - searches everything)
@trace_id:550e8400-e29b-41d4-a716-446655440000
# kubectl (real-time)
kubectl logs -l app=bonsapi | grep "550e8400-e29b-41d4-a716-446655440000"
# CloudWatch Insights
fields @timestamp, @message, @logStream
| filter @message like /550e8400-e29b-41d4-a716-446655440000/
| sort @timestamp asc
Legacy Request ID Search #
For older logs without trace IDs:
# kubectl
kubectl logs -l app=bonsapi | grep "request_id:abc-123"
# Datadog
@request_id:abc-123
# CloudWatch Insights
fields @timestamp, @message, @logStream
| filter @message like /request_id:abc-123/
| sort @timestamp asc
Finding Errors #
All errors in last hour:
# kubectl (live)
kubectl logs -l app=bonsapi --since=1h | grep -i error
# Datadog
service:bonsapi status:error @timestamp:[now-1h TO now]
# CloudWatch
fields @timestamp, @message
| filter @message like /ERROR|FATAL/
| filter @timestamp >= now() - 1h
| sort @timestamp desc
User Activity Logs #
Track specific user actions:
# kubectl
kubectl logs -l app=bonsapi | grep "user_id:12345"
# Datadog
@user.id:12345
# CloudWatch
fields @timestamp, @message
| filter @message like /user_id:12345/
| sort @timestamp asc
Document Processing Logs #
Find logs for specific document:
# kubectl
kubectl logs -l app=bonsai-invoice | grep "document_id:doc-123"
# Datadog
service:bonsai-invoice @document_id:doc-123
# CloudWatch
fields @timestamp, @message
| filter @logStream like /bonsai-invoice/
| filter @message like /document_id:doc-123/
| sort @timestamp asc
Database Query Logs #
Find slow database queries:
# Datadog
service:bonsapi @db.statement_duration:>1000
# CloudWatch
fields @timestamp, @message
| filter @message like /slow query/
| filter @message like /duration/
| sort @timestamp desc
Troubleshooting with Logs #
Investigating 500 Errors #
-
Find the error logs
kubectl logs -l app=bonsapi | grep "500\|ERROR" | tail -20 -
Get the request ID from the error log
-
Trace the full request
kubectl logs -l app=bonsapi | grep "request_id:<id>" -
Check stack trace for root cause
-
Verify database/external service logs
Debugging Slow Performance #
-
Find slow requests
# Datadog @http.response_time:>2000 service:bonsapi -
Analyze patterns
- Which endpoints are slow?
- What time of day?
- Specific users or all users?
-
Check resource usage
kubectl top pods -
Review database query performance
Authentication Issues #
-
Find auth failures
# Datadog service:bonsapi @message:*authentication*failed* -
Check for patterns
- Specific users?
- Specific auth method?
- Timing correlation?
-
Verify Clerk service status
-
Check JWT validation logs
Log Management Best Practices #
What to Log #
DO log:
- Request/response metadata (IDs, timestamps)
- Errors and exceptions with context
- Important business events (document processed, payment completed)
- Performance metrics (query time, processing duration)
- Security events (auth failures, permission denials)
DON’T log:
- Passwords or secrets
- Credit card numbers or PII
- Large binary data
- Excessive debug logs in production
Log Levels #
Use appropriate log levels:
- DEBUG - Detailed development information (dev only)
- INFO - General informational messages
- WARN - Warning messages, potential issues
- ERROR - Error messages, handled exceptions
- FATAL/CRITICAL - Critical errors, service shutdown
Structured Logging #
Use structured logs (JSON) for easy parsing:
// Good
logger.info('User logged in', {
user_id: '123',
ip_address: '1.2.3.4',
auth_method: 'oauth'
});
// Bad
logger.info(`User 123 logged in from 1.2.3.4 via oauth`);
Log Context #
Include correlation IDs for tracing:
// Rust example
tracing::info!(
request_id = %request.id,
user_id = %user.id,
"Processing invoice"
);
Exporting Logs #
Download from CloudWatch #
# Using AWS CLI
aws logs filter-log-events \
--log-group-name /eks/prod/pods \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--output json > logs.json
Download from S3 Archive #
# List files for specific date
aws s3 ls s3://bonsai-logs-prod/application-log/2025/10/21/
# Download specific log file
aws s3 cp s3://bonsai-logs-prod/application-log/2025/10/21/log-file.gz ./
# Download entire day
aws s3 sync s3://bonsai-logs-prod/application-log/2025/10/21/ ./logs/
Export from Datadog #
- Navigate to Logs in Datadog
- Apply filters for desired logs
- Click Export → Download as CSV or JSON
- Select time range and fields
Log Retention & Compliance #
Retention Policies #
| Location | Retention | Purpose |
|---|---|---|
| Kubernetes | Real-time only | Active debugging |
| CloudWatch | 3 days | Recent troubleshooting |
| Datadog | 15 days | Search and analysis |
| S3 Archive | 365 days | Compliance, auditing |
Compliance Considerations #
- GDPR - User data can be requested and deleted
- SOC 2 - Audit logs must be retained
- Access control - Logs may contain sensitive data
- Data privacy - PII should be masked or encrypted
See Also #
- Monitoring & Alerting - Using logs with metrics
- Kubernetes Debugging - Accessing pod logs
- Incident Response - Using logs during incidents
- Database Access - Database query logs
- LLM Observability - Tracing LLM calls with trace IDs