LLM Observability

LLM Observability #

This runbook covers monitoring and debugging Large Language Model (LLM) operations in BonsAI using Datadog’s LLM Observability platform.

When to Use This Runbook #

  • Investigating LLM performance issues (slow responses, high costs)
  • Debugging document extraction errors
  • Analyzing prompt effectiveness
  • Monitoring token usage and costs
  • Tracing LLM calls end-to-end with trace IDs
  • Optimizing prompts and model selection

Overview #

BonsAI uses multiple LLM providers for document processing:

  • OpenAI (GPT-4, GPT-3.5) - Primary provider for invoice extraction
  • Anthropic Claude - Alternative for complex documents
  • Mistral AI - Cost-effective option for specific tasks

All LLM calls are instrumented with Datadog LLM Observability for comprehensive monitoring.

Accessing LLM Observability #

  1. Navigate to Datadog

    • URL: https://datadoghq.com (US3 region)
    • Use company SSO to log in
  2. Open LLM Observability

    • Left sidebar → LLM Observability
    • Or directly: https://app.datadoghq.com/llm
  3. Key Views

    • Traces - Individual LLM call traces
    • Applications - Service-level metrics
    • Workflows - Multi-step LLM operations
    • Costs - Token usage and cost analysis

How LLM Observability Works #

Instrumentation Architecture #

User Request (Frontend)
    ↓
X-TOFU-TRACE-ID generated
    ↓
BonsAPI receives request
    ↓
RabbitMQ message with trace_id
    ↓
bonsai-invoice worker picks up job
    ↓
Python decorator @llm_obs_workflow
    ↓
LLM call with trace_id annotation
    ↓
Datadog LLM Observability records:
  - Input/output
  - Latency
  - Token usage
  - Cost
  - Model/provider
  - Trace ID

Code Integration #

Python Decorators (Hinoki Library):

# Location: libs/python/bonsai-hinoki/hinoki/src/hinoki/utils/decorators/llm_obs/decorator.py

# Workflow-level tracking
@llm_obs_workflow_aync(workflow_name="invoice_extraction")
async def extract_invoice_data(document):
    trace_id = current_trace_id()
    # LLM operations tracked automatically
    ...

# LLM call tracking
@llm_obs_llm(model_provider="openai")
async def call_openai(messages, model):
    # Tracks input, output, tokens, cost
    ...

# Task tracking
@llm_obs_task(task_name="vendor_matching")
async def match_vendor(invoice_data):
    # Tracks sub-tasks within workflow
    ...

Trace ID Flow:

# bonsai_utils.logging provides trace context
from bonsai_utils.logging import current_trace_id

trace_id = current_trace_id()  # Gets X-TOFU-TRACE-ID from request

# Annotate LLM calls with trace ID
LLMObs.annotate(
    input_data={"trace_id": trace_id, "args": args},
    tags={"workflow_name": "invoice_extraction"}
)

Common Tasks #

Tracing an LLM Call by Trace ID #

Scenario: User reports incorrect invoice extraction

Step 1: Get the trace ID

From frontend or API logs:

X-TOFU-TRACE-ID: 550e8400-e29b-41d4-a716-446655440000

Step 2: Search in LLM Observability

  1. Navigate to Datadog → LLM Observability → Traces
  2. Search by trace ID:
    @trace_id:550e8400-e29b-41d4-a716-446655440000
    
  3. Or use metadata filter:
    @metadata.trace_id:550e8400-e29b-41d4-a716-446655440000
    

Step 3: Analyze the LLM trace

View detailed information:

  • Input prompt - What was sent to the LLM
  • Output response - What the LLM returned
  • Model used - (e.g., gpt-4-turbo)
  • Token usage - Input/output tokens
  • Latency - Time to first token, total time
  • Cost - Estimated cost of the call
  • Workflow context - Parent workflow and tasks

Step 4: Compare with logs

Cross-reference with application logs:

# In Datadog Logs
@trace_id:550e8400-e29b-41d4-a716-446655440000

# Shows complete flow:
1. User uploaded document
2. Document sent to processing queue
3. LLM called for extraction
4. Results stored in database

Investigating Slow LLM Responses #

Find slow LLM calls:

  1. LLM Observability → Traces
  2. Filter by latency:
    @duration:>30s
    
  3. Group by:
    • Model (is one model slower?)
    • Workflow (which workflow is slow?)
    • Provider (provider-specific issues?)

Common causes:

  • Large prompts - Too much context sent to LLM
  • Model selection - Wrong model for the task (GPT-4 vs GPT-3.5)
  • Provider throttling - Rate limits hit
  • Network issues - Connectivity to LLM provider

Resolution:

# Optimize prompt size
# Before: Sending entire 50-page document
prompt = f"Extract data from: {full_document_text}"

# After: Send only relevant pages
prompt = f"Extract data from: {relevant_pages}"

# Use faster model for simple tasks
# Before: gpt-4-turbo for all tasks
model = "gpt-4-turbo"

# After: Use gpt-3.5 for simple extraction
model = "gpt-3.5-turbo" if is_simple else "gpt-4-turbo"

Analyzing Token Usage and Costs #

View cost breakdown:

  1. LLM Observability → Applications
  2. Select application: bonsai-invoice
  3. View metrics:
    • Total tokens (input + output)
    • Cost per day/week/month
    • Cost by model
    • Cost by workflow

Identify expensive operations:

Filter: @metadata.workflow_name:invoice_extraction
Group by: @model_name
Sort by: Total cost (descending)

Results:
- gpt-4-turbo: $450/day (80% of cost)
- gpt-3.5-turbo: $95/day (20% of cost)

Cost optimization strategies:

  1. Use cheaper models when possible

    # Simple extraction → gpt-3.5-turbo
    # Complex analysis → gpt-4-turbo
    
  2. Reduce prompt size

    # Only send relevant document sections
    # Remove redundant context
    
  3. Implement caching

    # Cache vendor matching results
    # Reuse extracted data when possible
    
  4. Batch operations

    # Extract multiple invoices in one call
    # Reduce per-call overhead
    

Debugging Extraction Errors #

Scenario: LLM extracts wrong data

Step 1: Find the trace

# Search by document ID
@metadata.document_id:doc-12345

# Or by workflow
@metadata.workflow_name:invoice_extraction

Step 2: Review input prompt

  1. Click on LLM trace
  2. View Input tab
  3. Check:
    • Is the prompt clear?
    • Is document text readable?
    • Are instructions specific?

Step 3: Review output

  1. View Output tab
  2. Compare with expected result
  3. Identify pattern:
    • Missing fields?
    • Incorrect format?
    • Hallucinated data?

Step 4: Test prompt improvements

# Bad prompt (vague)
"Extract invoice data"

# Good prompt (specific)
"""Extract the following fields from the invoice:
- Invoice number (format: INV-XXXXX)
- Invoice date (format: YYYY-MM-DD)
- Total amount (numeric value only)
- Vendor name (as shown on invoice)

Return as JSON with exact field names."""

Monitoring Workflow Performance #

View workflow metrics:

  1. LLM Observability → Applications → bonsai-invoice
  2. View workflows:
    • invoice_extraction - Main extraction workflow
    • vendor_matching - Vendor identification
    • line_item_extraction - Line-by-line extraction

Key metrics per workflow:

  • Success rate - % of successful completions
  • Average latency - Time to complete
  • Token usage - Tokens consumed
  • Cost - Total cost
  • Error rate - % of failures

Set up alerts:

Alert: invoice_extraction success rate < 95%
Threshold: 95%
Evaluation window: 15 minutes
Notification: PagerDuty (SEV2)

LLM Observability Best Practices #

Comprehensive Instrumentation #

DO instrument:

  • All LLM calls (OpenAI, Claude, Mistral)
  • Multi-step workflows
  • Prompt variations (A/B testing)
  • Error handling and retries

Example:

@llm_obs_workflow_aync(workflow_name="invoice_extraction")
async def extract_invoice(document_id: str):
    trace_id = current_trace_id()

    # Step 1: Document preprocessing
    with LLMObs.task(name="preprocess_document"):
        processed_doc = await preprocess(document_id)

    # Step 2: LLM extraction
    with LLMObs.llm(model_name="gpt-4-turbo", name="extract_fields"):
        result = await call_llm(processed_doc)

    # Step 3: Post-processing
    with LLMObs.task(name="validate_output"):
        validated = await validate(result)

    return validated

Meaningful Annotations #

Add context to traces:

LLMObs.annotate(
    input_data={
        "trace_id": trace_id,
        "document_id": document_id,
        "document_type": "invoice",
        "page_count": page_count
    },
    output_data={
        "extraction_confidence": 0.95,
        "fields_extracted": ["invoice_number", "date", "total"]
    },
    tags={
        "customer_id": customer_id,
        "workflow_version": "v2.1"
    }
)

Cost Monitoring #

Track costs by:

  • Customer/organization
  • Document type (invoice, receipt, contract)
  • Model provider
  • Workflow type

Set cost alerts:

Alert: Daily LLM cost > $500
Threshold: $500
Notification: Slack #eng-alerts

Error Handling #

Log errors with context:

try:
    result = await call_llm(prompt)
except Exception as e:
    LLMObs.annotate(
        output_data={
            "error": str(e),
            "error_type": type(e).__name__,
            "retry_count": retry_count
        }
    )
    raise

Common Issues & Solutions #

Issue 1: Missing LLM Traces #

Problem: LLM calls not appearing in Datadog

Causes:

  • Decorator not applied
  • Datadog API key missing
  • Network connectivity issues

Solutions:

# 1. Verify decorator is applied
grep -r "@llm_obs" apps/bonsai-invoice/

# 2. Check Datadog API key
doppler secrets get DD_API_KEY --project bonsai --config prod

# 3. Test connectivity
kubectl exec -it <bonsai-invoice-pod> -- curl https://api.datadoghq.com

# 4. Check pod logs for Datadog errors
kubectl logs -l app=bonsai-invoice | grep -i datadog

Issue 2: Incomplete Trace Context #

Problem: Trace ID not linking frontend to LLM calls

Causes:

  • Trace ID not passed through RabbitMQ
  • Trace context lost in async operations

Solutions:

# Ensure trace_id is in message payload
message = {
    "document_id": document_id,
    "trace_id": trace_id,  # ← Must include
    "job_type": "invoice_extraction"
}

# Set trace context in worker
from bonsai_utils.logging import set_trace_context
set_trace_context(message["trace_id"])

Issue 3: High LLM Costs #

Problem: Unexpected spike in LLM spending

Investigation:

  1. Identify cost source

    LLM Observability → Applications → Cost breakdown
    Group by: workflow, model, customer
    
  2. Find expensive calls

    Filter: @cost:>1.00
    Sort by: Cost (descending)
    
  3. Analyze patterns

    • Large documents being processed repeatedly?
    • Wrong model for simple tasks?
    • Retry loops causing duplicate calls?

Solutions:

# 1. Implement cost limits
if estimated_cost > MAX_COST_PER_DOCUMENT:
    use_cheaper_model = True

# 2. Add caching
if cached_result := get_cached_extraction(document_hash):
    return cached_result

# 3. Optimize prompts
prompt = optimize_prompt_size(prompt, max_tokens=2000)

Issue 4: Low Extraction Accuracy #

Problem: LLM extracting incorrect data

Investigation:

  1. Review failed extractions

    Filter: @metadata.extraction_confidence:<0.8
    
  2. Compare prompts

    • View input prompts for low-confidence extractions
    • Identify common patterns in failures
  3. Test prompt variations

    • Create A/B test with improved prompts
    • Track accuracy by prompt version

Solutions:

# Improve prompt specificity
prompt_v1 = "Extract invoice data"  # Vague
prompt_v2 = """Extract these exact fields:
1. Invoice Number: Find text labeled "Invoice #" or "Invoice Number"
2. Date: Find text labeled "Invoice Date" or "Date"
3. Total: Find text labeled "Total" or "Amount Due"

Format as JSON."""  # Specific

# Add examples (few-shot learning)
prompt_v3 = f"""{prompt_v2}

Example input: "Invoice #12345, Date: 2025-01-15, Total: $500.00"
Example output: {{"invoice_number": "12345", "date": "2025-01-15", "total": 500.00}}

Now extract from: {document_text}"""

Performance Optimization #

Reduce Latency #

  1. Use faster models

    • gpt-3.5-turbo instead of gpt-4 for simple tasks
    • Streaming responses for real-time UX
  2. Parallel LLM calls

    # Sequential (slow)
    vendor = await extract_vendor(doc)
    items = await extract_items(doc)
    
    # Parallel (fast)
    vendor, items = await asyncio.gather(
        extract_vendor(doc),
        extract_items(doc)
    )
    
  3. Optimize prompt size

    • Only send relevant document sections
    • Remove redundant instructions

Reduce Costs #

  1. Model selection

    # Cost comparison (per 1M tokens)
    # gpt-4-turbo: $30 input / $60 output
    # gpt-3.5-turbo: $0.50 input / $1.50 output
    
    # Use cheaper model for 95% of tasks
    model = "gpt-3.5-turbo" if is_simple_invoice else "gpt-4-turbo"
    
  2. Caching

    # Cache extracted data
    cache_key = f"extraction:{document_hash}"
    if cached := redis.get(cache_key):
        return cached
    
    result = await extract_with_llm(document)
    redis.setex(cache_key, 86400, result)  # 24h cache
    
  3. Batch processing

    # Process multiple documents in one LLM call
    prompt = f"""Extract data from these {len(invoices)} invoices:
    
    Invoice 1: {invoice_1_text}
    Invoice 2: {invoice_2_text}
    ...
    
    Return as array of JSON objects."""
    

Dashboards and Alerts #

Key Dashboards #

Create custom dashboards in Datadog:

  1. LLM Operations Dashboard

    • Total LLM calls (by service)
    • Average latency (by model)
    • Success rate (by workflow)
    • Cost trend (daily/weekly)
  2. Cost Monitoring Dashboard

    • Daily spend by model
    • Cost per customer
    • Token usage trends
    • Cost anomalies
  3. Quality Dashboard

    • Extraction accuracy
    • Confidence scores
    • Error rates by workflow
    • Retry rates
# Alert 1: High LLM latency
Alert: "LLM calls taking >30 seconds"
Condition: avg:llm.duration{workflow:invoice_extraction} > 30
Window: last_15m
Severity: WARNING
Notify: #eng-alerts

# Alert 2: Cost spike
Alert: "Daily LLM cost exceeds budget"
Condition: sum:llm.cost{service:bonsai-invoice} > 500
Window: last_24h
Severity: WARNING
Notify: #eng-alerts, finance@gotofu.com

# Alert 3: Low accuracy
Alert: "Extraction confidence below threshold"
Condition: avg:llm.confidence{workflow:invoice_extraction} < 0.85
Window: last_1h
Severity: WARNING
Notify: #eng-alerts

# Alert 4: High error rate
Alert: "LLM error rate above 5%"
Condition: sum:llm.errors{service:bonsai-invoice}.as_rate() > 0.05
Window: last_15m
Severity: CRITICAL
Notify: PagerDuty

See Also #