LLM Observability #
This runbook covers monitoring and debugging Large Language Model (LLM) operations in BonsAI using Datadog’s LLM Observability platform.
When to Use This Runbook #
- Investigating LLM performance issues (slow responses, high costs)
- Debugging document extraction errors
- Analyzing prompt effectiveness
- Monitoring token usage and costs
- Tracing LLM calls end-to-end with trace IDs
- Optimizing prompts and model selection
Overview #
BonsAI uses multiple LLM providers for document processing:
- OpenAI (GPT-4, GPT-3.5) - Primary provider for invoice extraction
- Anthropic Claude - Alternative for complex documents
- Mistral AI - Cost-effective option for specific tasks
All LLM calls are instrumented with Datadog LLM Observability for comprehensive monitoring.
Accessing LLM Observability #
-
Navigate to Datadog
- URL:
https://datadoghq.com(US3 region) - Use company SSO to log in
- URL:
-
Open LLM Observability
- Left sidebar → LLM Observability
- Or directly:
https://app.datadoghq.com/llm
-
Key Views
- Traces - Individual LLM call traces
- Applications - Service-level metrics
- Workflows - Multi-step LLM operations
- Costs - Token usage and cost analysis
How LLM Observability Works #
Instrumentation Architecture #
User Request (Frontend)
↓
X-TOFU-TRACE-ID generated
↓
BonsAPI receives request
↓
RabbitMQ message with trace_id
↓
bonsai-invoice worker picks up job
↓
Python decorator @llm_obs_workflow
↓
LLM call with trace_id annotation
↓
Datadog LLM Observability records:
- Input/output
- Latency
- Token usage
- Cost
- Model/provider
- Trace ID
Code Integration #
Python Decorators (Hinoki Library):
# Location: libs/python/bonsai-hinoki/hinoki/src/hinoki/utils/decorators/llm_obs/decorator.py
# Workflow-level tracking
@llm_obs_workflow_aync(workflow_name="invoice_extraction")
async def extract_invoice_data(document):
trace_id = current_trace_id()
# LLM operations tracked automatically
...
# LLM call tracking
@llm_obs_llm(model_provider="openai")
async def call_openai(messages, model):
# Tracks input, output, tokens, cost
...
# Task tracking
@llm_obs_task(task_name="vendor_matching")
async def match_vendor(invoice_data):
# Tracks sub-tasks within workflow
...
Trace ID Flow:
# bonsai_utils.logging provides trace context
from bonsai_utils.logging import current_trace_id
trace_id = current_trace_id() # Gets X-TOFU-TRACE-ID from request
# Annotate LLM calls with trace ID
LLMObs.annotate(
input_data={"trace_id": trace_id, "args": args},
tags={"workflow_name": "invoice_extraction"}
)
Common Tasks #
Tracing an LLM Call by Trace ID #
Scenario: User reports incorrect invoice extraction
Step 1: Get the trace ID
From frontend or API logs:
X-TOFU-TRACE-ID: 550e8400-e29b-41d4-a716-446655440000
Step 2: Search in LLM Observability
- Navigate to Datadog → LLM Observability → Traces
- Search by trace ID:
@trace_id:550e8400-e29b-41d4-a716-446655440000 - Or use metadata filter:
@metadata.trace_id:550e8400-e29b-41d4-a716-446655440000
Step 3: Analyze the LLM trace
View detailed information:
- Input prompt - What was sent to the LLM
- Output response - What the LLM returned
- Model used - (e.g., gpt-4-turbo)
- Token usage - Input/output tokens
- Latency - Time to first token, total time
- Cost - Estimated cost of the call
- Workflow context - Parent workflow and tasks
Step 4: Compare with logs
Cross-reference with application logs:
# In Datadog Logs
@trace_id:550e8400-e29b-41d4-a716-446655440000
# Shows complete flow:
1. User uploaded document
2. Document sent to processing queue
3. LLM called for extraction
4. Results stored in database
Investigating Slow LLM Responses #
Find slow LLM calls:
- LLM Observability → Traces
- Filter by latency:
@duration:>30s - Group by:
- Model (is one model slower?)
- Workflow (which workflow is slow?)
- Provider (provider-specific issues?)
Common causes:
- Large prompts - Too much context sent to LLM
- Model selection - Wrong model for the task (GPT-4 vs GPT-3.5)
- Provider throttling - Rate limits hit
- Network issues - Connectivity to LLM provider
Resolution:
# Optimize prompt size
# Before: Sending entire 50-page document
prompt = f"Extract data from: {full_document_text}"
# After: Send only relevant pages
prompt = f"Extract data from: {relevant_pages}"
# Use faster model for simple tasks
# Before: gpt-4-turbo for all tasks
model = "gpt-4-turbo"
# After: Use gpt-3.5 for simple extraction
model = "gpt-3.5-turbo" if is_simple else "gpt-4-turbo"
Analyzing Token Usage and Costs #
View cost breakdown:
- LLM Observability → Applications
- Select application:
bonsai-invoice - View metrics:
- Total tokens (input + output)
- Cost per day/week/month
- Cost by model
- Cost by workflow
Identify expensive operations:
Filter: @metadata.workflow_name:invoice_extraction
Group by: @model_name
Sort by: Total cost (descending)
Results:
- gpt-4-turbo: $450/day (80% of cost)
- gpt-3.5-turbo: $95/day (20% of cost)
Cost optimization strategies:
-
Use cheaper models when possible
# Simple extraction → gpt-3.5-turbo # Complex analysis → gpt-4-turbo -
Reduce prompt size
# Only send relevant document sections # Remove redundant context -
Implement caching
# Cache vendor matching results # Reuse extracted data when possible -
Batch operations
# Extract multiple invoices in one call # Reduce per-call overhead
Debugging Extraction Errors #
Scenario: LLM extracts wrong data
Step 1: Find the trace
# Search by document ID
@metadata.document_id:doc-12345
# Or by workflow
@metadata.workflow_name:invoice_extraction
Step 2: Review input prompt
- Click on LLM trace
- View Input tab
- Check:
- Is the prompt clear?
- Is document text readable?
- Are instructions specific?
Step 3: Review output
- View Output tab
- Compare with expected result
- Identify pattern:
- Missing fields?
- Incorrect format?
- Hallucinated data?
Step 4: Test prompt improvements
# Bad prompt (vague)
"Extract invoice data"
# Good prompt (specific)
"""Extract the following fields from the invoice:
- Invoice number (format: INV-XXXXX)
- Invoice date (format: YYYY-MM-DD)
- Total amount (numeric value only)
- Vendor name (as shown on invoice)
Return as JSON with exact field names."""
Monitoring Workflow Performance #
View workflow metrics:
- LLM Observability → Applications → bonsai-invoice
- View workflows:
invoice_extraction- Main extraction workflowvendor_matching- Vendor identificationline_item_extraction- Line-by-line extraction
Key metrics per workflow:
- Success rate - % of successful completions
- Average latency - Time to complete
- Token usage - Tokens consumed
- Cost - Total cost
- Error rate - % of failures
Set up alerts:
Alert: invoice_extraction success rate < 95%
Threshold: 95%
Evaluation window: 15 minutes
Notification: PagerDuty (SEV2)
LLM Observability Best Practices #
Comprehensive Instrumentation #
DO instrument:
- All LLM calls (OpenAI, Claude, Mistral)
- Multi-step workflows
- Prompt variations (A/B testing)
- Error handling and retries
Example:
@llm_obs_workflow_aync(workflow_name="invoice_extraction")
async def extract_invoice(document_id: str):
trace_id = current_trace_id()
# Step 1: Document preprocessing
with LLMObs.task(name="preprocess_document"):
processed_doc = await preprocess(document_id)
# Step 2: LLM extraction
with LLMObs.llm(model_name="gpt-4-turbo", name="extract_fields"):
result = await call_llm(processed_doc)
# Step 3: Post-processing
with LLMObs.task(name="validate_output"):
validated = await validate(result)
return validated
Meaningful Annotations #
Add context to traces:
LLMObs.annotate(
input_data={
"trace_id": trace_id,
"document_id": document_id,
"document_type": "invoice",
"page_count": page_count
},
output_data={
"extraction_confidence": 0.95,
"fields_extracted": ["invoice_number", "date", "total"]
},
tags={
"customer_id": customer_id,
"workflow_version": "v2.1"
}
)
Cost Monitoring #
Track costs by:
- Customer/organization
- Document type (invoice, receipt, contract)
- Model provider
- Workflow type
Set cost alerts:
Alert: Daily LLM cost > $500
Threshold: $500
Notification: Slack #eng-alerts
Error Handling #
Log errors with context:
try:
result = await call_llm(prompt)
except Exception as e:
LLMObs.annotate(
output_data={
"error": str(e),
"error_type": type(e).__name__,
"retry_count": retry_count
}
)
raise
Common Issues & Solutions #
Issue 1: Missing LLM Traces #
Problem: LLM calls not appearing in Datadog
Causes:
- Decorator not applied
- Datadog API key missing
- Network connectivity issues
Solutions:
# 1. Verify decorator is applied
grep -r "@llm_obs" apps/bonsai-invoice/
# 2. Check Datadog API key
doppler secrets get DD_API_KEY --project bonsai --config prod
# 3. Test connectivity
kubectl exec -it <bonsai-invoice-pod> -- curl https://api.datadoghq.com
# 4. Check pod logs for Datadog errors
kubectl logs -l app=bonsai-invoice | grep -i datadog
Issue 2: Incomplete Trace Context #
Problem: Trace ID not linking frontend to LLM calls
Causes:
- Trace ID not passed through RabbitMQ
- Trace context lost in async operations
Solutions:
# Ensure trace_id is in message payload
message = {
"document_id": document_id,
"trace_id": trace_id, # ← Must include
"job_type": "invoice_extraction"
}
# Set trace context in worker
from bonsai_utils.logging import set_trace_context
set_trace_context(message["trace_id"])
Issue 3: High LLM Costs #
Problem: Unexpected spike in LLM spending
Investigation:
-
Identify cost source
LLM Observability → Applications → Cost breakdown Group by: workflow, model, customer -
Find expensive calls
Filter: @cost:>1.00 Sort by: Cost (descending) -
Analyze patterns
- Large documents being processed repeatedly?
- Wrong model for simple tasks?
- Retry loops causing duplicate calls?
Solutions:
# 1. Implement cost limits
if estimated_cost > MAX_COST_PER_DOCUMENT:
use_cheaper_model = True
# 2. Add caching
if cached_result := get_cached_extraction(document_hash):
return cached_result
# 3. Optimize prompts
prompt = optimize_prompt_size(prompt, max_tokens=2000)
Issue 4: Low Extraction Accuracy #
Problem: LLM extracting incorrect data
Investigation:
-
Review failed extractions
Filter: @metadata.extraction_confidence:<0.8 -
Compare prompts
- View input prompts for low-confidence extractions
- Identify common patterns in failures
-
Test prompt variations
- Create A/B test with improved prompts
- Track accuracy by prompt version
Solutions:
# Improve prompt specificity
prompt_v1 = "Extract invoice data" # Vague
prompt_v2 = """Extract these exact fields:
1. Invoice Number: Find text labeled "Invoice #" or "Invoice Number"
2. Date: Find text labeled "Invoice Date" or "Date"
3. Total: Find text labeled "Total" or "Amount Due"
Format as JSON.""" # Specific
# Add examples (few-shot learning)
prompt_v3 = f"""{prompt_v2}
Example input: "Invoice #12345, Date: 2025-01-15, Total: $500.00"
Example output: {{"invoice_number": "12345", "date": "2025-01-15", "total": 500.00}}
Now extract from: {document_text}"""
Performance Optimization #
Reduce Latency #
-
Use faster models
- gpt-3.5-turbo instead of gpt-4 for simple tasks
- Streaming responses for real-time UX
-
Parallel LLM calls
# Sequential (slow) vendor = await extract_vendor(doc) items = await extract_items(doc) # Parallel (fast) vendor, items = await asyncio.gather( extract_vendor(doc), extract_items(doc) ) -
Optimize prompt size
- Only send relevant document sections
- Remove redundant instructions
Reduce Costs #
-
Model selection
# Cost comparison (per 1M tokens) # gpt-4-turbo: $30 input / $60 output # gpt-3.5-turbo: $0.50 input / $1.50 output # Use cheaper model for 95% of tasks model = "gpt-3.5-turbo" if is_simple_invoice else "gpt-4-turbo" -
Caching
# Cache extracted data cache_key = f"extraction:{document_hash}" if cached := redis.get(cache_key): return cached result = await extract_with_llm(document) redis.setex(cache_key, 86400, result) # 24h cache -
Batch processing
# Process multiple documents in one LLM call prompt = f"""Extract data from these {len(invoices)} invoices: Invoice 1: {invoice_1_text} Invoice 2: {invoice_2_text} ... Return as array of JSON objects."""
Dashboards and Alerts #
Key Dashboards #
Create custom dashboards in Datadog:
-
LLM Operations Dashboard
- Total LLM calls (by service)
- Average latency (by model)
- Success rate (by workflow)
- Cost trend (daily/weekly)
-
Cost Monitoring Dashboard
- Daily spend by model
- Cost per customer
- Token usage trends
- Cost anomalies
-
Quality Dashboard
- Extraction accuracy
- Confidence scores
- Error rates by workflow
- Retry rates
Recommended Alerts #
# Alert 1: High LLM latency
Alert: "LLM calls taking >30 seconds"
Condition: avg:llm.duration{workflow:invoice_extraction} > 30
Window: last_15m
Severity: WARNING
Notify: #eng-alerts
# Alert 2: Cost spike
Alert: "Daily LLM cost exceeds budget"
Condition: sum:llm.cost{service:bonsai-invoice} > 500
Window: last_24h
Severity: WARNING
Notify: #eng-alerts, finance@gotofu.com
# Alert 3: Low accuracy
Alert: "Extraction confidence below threshold"
Condition: avg:llm.confidence{workflow:invoice_extraction} < 0.85
Window: last_1h
Severity: WARNING
Notify: #eng-alerts
# Alert 4: High error rate
Alert: "LLM error rate above 5%"
Condition: sum:llm.errors{service:bonsai-invoice}.as_rate() > 0.05
Window: last_15m
Severity: CRITICAL
Notify: PagerDuty
See Also #
- Log Management - End-to-end tracing with trace IDs
- Monitoring & Alerting - General monitoring practices
- RabbitMQ Management - Queue-based LLM job processing
- Service Health - Health checks for LLM-dependent services
- Incident Response - Handling LLM-related incidents