LLM Observability #

This runbook covers monitoring and debugging Large Language Model (LLM) operations in BonsAI using Datadog’s LLM Observability platform.

When to Use This Runbook #

Investigating LLM performance issues (slow responses, high costs)
Debugging document extraction errors
Analyzing prompt effectiveness
Monitoring token usage and costs
Tracing LLM calls end-to-end with trace IDs
Optimizing prompts and model selection

Overview #

BonsAI uses multiple LLM providers for document processing:

OpenAI (GPT-4, GPT-3.5) - Primary provider for invoice extraction
Anthropic Claude - Alternative for complex documents
Mistral AI - Cost-effective option for specific tasks

All LLM calls are instrumented with Datadog LLM Observability for comprehensive monitoring.

Accessing LLM Observability #

Navigate to Datadog
- URL: https://datadoghq.com (US3 region)
- Use company SSO to log in
Open LLM Observability
- Left sidebar → LLM Observability
- Or directly: https://app.datadoghq.com/llm
Key Views
- Traces - Individual LLM call traces
- Applications - Service-level metrics
- Workflows - Multi-step LLM operations
- Costs - Token usage and cost analysis

How LLM Observability Works #

Instrumentation Architecture #

User Request (Frontend)
    ↓
X-TOFU-TRACE-ID generated
    ↓
BonsAPI receives request
    ↓
RabbitMQ message with trace_id
    ↓
bonsai-invoice worker picks up job
    ↓
Python decorator @llm_obs_workflow
    ↓
LLM call with trace_id annotation
    ↓
Datadog LLM Observability records:
  - Input/output
  - Latency
  - Token usage
  - Cost
  - Model/provider
  - Trace ID

Code Integration #

Python Decorators (Hinoki Library):

# Location: libs/python/bonsai-hinoki/hinoki/src/hinoki/utils/decorators/llm_obs/decorator.py

# Workflow-level tracking
@llm_obs_workflow_aync(workflow_name="invoice_extraction")
async def extract_invoice_data(document):
    trace_id = current_trace_id()
    # LLM operations tracked automatically
    ...

# LLM call tracking
@llm_obs_llm(model_provider="openai")
async def call_openai(messages, model):
    # Tracks input, output, tokens, cost
    ...

# Task tracking
@llm_obs_task(task_name="vendor_matching")
async def match_vendor(invoice_data):
    # Tracks sub-tasks within workflow
    ...

Trace ID Flow:

# bonsai_utils.logging provides trace context
from bonsai_utils.logging import current_trace_id

trace_id = current_trace_id()  # Gets X-TOFU-TRACE-ID from request

# Annotate LLM calls with trace ID
LLMObs.annotate(
    input_data={"trace_id": trace_id, "args": args},
    tags={"workflow_name": "invoice_extraction"}
)

Common Tasks #

Tracing an LLM Call by Trace ID #

Scenario: User reports incorrect invoice extraction

Step 1: Get the trace ID

From frontend or API logs:

X-TOFU-TRACE-ID: 550e8400-e29b-41d4-a716-446655440000

Step 2: Search in LLM Observability

Navigate to Datadog → LLM Observability → Traces

Search by trace ID:

@trace_id:550e8400-e29b-41d4-a716-446655440000

Or use metadata filter:

@metadata.trace_id:550e8400-e29b-41d4-a716-446655440000

Step 3: Analyze the LLM trace

View detailed information:

Input prompt - What was sent to the LLM
Output response - What the LLM returned
Model used - (e.g., gpt-4-turbo)
Token usage - Input/output tokens
Latency - Time to first token, total time
Cost - Estimated cost of the call
Workflow context - Parent workflow and tasks

Step 4: Compare with logs

Cross-reference with application logs:

# In Datadog Logs
@trace_id:550e8400-e29b-41d4-a716-446655440000

# Shows complete flow:
1. User uploaded document
2. Document sent to processing queue
3. LLM called for extraction
4. Results stored in database

Investigating Slow LLM Responses #

Find slow LLM calls:

LLM Observability → Traces
Filter by latency:
```
@duration:>30s
```
Group by:
- Model (is one model slower?)
- Workflow (which workflow is slow?)
- Provider (provider-specific issues?)

Common causes:

Large prompts - Too much context sent to LLM
Model selection - Wrong model for the task (GPT-4 vs GPT-3.5)
Provider throttling - Rate limits hit
Network issues - Connectivity to LLM provider

Resolution:

# Optimize prompt size
# Before: Sending entire 50-page document
prompt = f"Extract data from: {full_document_text}"

# After: Send only relevant pages
prompt = f"Extract data from: {relevant_pages}"

# Use faster model for simple tasks
# Before: gpt-4-turbo for all tasks
model = "gpt-4-turbo"

# After: Use gpt-3.5 for simple extraction
model = "gpt-3.5-turbo" if is_simple else "gpt-4-turbo"

Analyzing Token Usage and Costs #

View cost breakdown:

LLM Observability → Applications
Select application: bonsai-invoice
View metrics:
- Total tokens (input + output)
- Cost per day/week/month
- Cost by model
- Cost by workflow

Identify expensive operations:

Filter: @metadata.workflow_name:invoice_extraction
Group by: @model_name
Sort by: Total cost (descending)

Results:
- gpt-4-turbo: $450/day (80% of cost)
- gpt-3.5-turbo: $95/day (20% of cost)

Cost optimization strategies:

Use cheaper models when possible

# Simple extraction → gpt-3.5-turbo
# Complex analysis → gpt-4-turbo

Reduce prompt size

# Only send relevant document sections
# Remove redundant context

Implement caching

# Cache vendor matching results
# Reuse extracted data when possible

Batch operations

# Extract multiple invoices in one call
# Reduce per-call overhead

Debugging Extraction Errors #

Scenario: LLM extracts wrong data

Step 1: Find the trace

# Search by document ID
@metadata.document_id:doc-12345

# Or by workflow
@metadata.workflow_name:invoice_extraction

Step 2: Review input prompt

Click on LLM trace
View Input tab
Check:
- Is the prompt clear?
- Is document text readable?
- Are instructions specific?

Step 3: Review output

View Output tab
Compare with expected result
Identify pattern:
- Missing fields?
- Incorrect format?
- Hallucinated data?

Step 4: Test prompt improvements

# Bad prompt (vague)
"Extract invoice data"

# Good prompt (specific)
"""Extract the following fields from the invoice:
- Invoice number (format: INV-XXXXX)
- Invoice date (format: YYYY-MM-DD)
- Total amount (numeric value only)
- Vendor name (as shown on invoice)

Return as JSON with exact field names."""

Monitoring Workflow Performance #

View workflow metrics:

LLM Observability → Applications → bonsai-invoice
View workflows:
- invoice_extraction - Main extraction workflow
- vendor_matching - Vendor identification
- line_item_extraction - Line-by-line extraction

Key metrics per workflow:

Success rate - % of successful completions
Average latency - Time to complete
Token usage - Tokens consumed
Cost - Total cost
Error rate - % of failures

Set up alerts:

Alert: invoice_extraction success rate < 95%
Threshold: 95%
Evaluation window: 15 minutes
Notification: PagerDuty (SEV2)