LLM Observability & Cost Optimization

Completed

April 2026

A practical guide to observability for LLM applications using Langfuse. Covers tracing, cost optimization, privacy compliance, and production monitoring with real examples across Claude and OpenAI.

Key Features

LLM Observability — trace every LLM call, nested span, and event across Claude and OpenAI
Cost Optimization — model routing, prompt optimization, and semantic caching (30–50% cost reduction)
Monitoring & Alerting — webhook alerts on cost spikes, extensible to Slack/PagerDuty
Privacy & Compliance — PII redaction before logging for GDPR/HIPAA/SOC2

LLM Observability

Three levels of instrumentation:

Level 1 (OpenAI wrapper) — zero code changes, wrap the client
Level 2 (@observe decorator) — trace any function; nested calls become child spans automatically
Level 3 (OpenTelemetry) — automatic tracing for LangChain with no decorators needed

Three types of observations:

Generation — LLM API calls: completions, token counts, model costs
Span — any operation with duration: DB queries, retrieval steps, processing
Event — point-in-time occurrences: cache hits, errors, milestones

Cost Optimization Strategy

Smart model routing — classifies task type (simple, code, complex, creative) and routes to the cheapest right-fit model (Haiku → Sonnet → GPT-4o)
Prompt optimization — strips filler phrases and deduplicates instructions to reduce input tokens
Semantic caching — uses sentence-transformers + ChromaDB to return cached responses for queries with >92% similarity; persists across restarts
Combined approach targets 50–70% reduction in LLM spend

Monitoring & Alerting

Checks hourly cost against a configurable threshold
Fires webhook alerts (webhook.site, Slack, Discord, PagerDuty)

Privacy & Compliance

Redacts PII (email, phone, SSN, credit card, IP) from prompts and responses before sending to Langfuse
Recursive dictionary redaction for nested payloads

RAG Pipeline

End-to-end retrieval-augmented generation with full trace visibility: document indexing → embedding → semantic retrieval → Claude generation, with token counts and cost tracked per step.

LangfuseLangchainPythonAnthropicOpenAIDebuggingPrivacyOpentelemetryChromaDBCost Optimization

Repository →