← Back to Projects

LLM Observability & Cost Optimization

Completed

April 2026

A practical guide to observability for LLM applications using Langfuse. Covers tracing, cost optimization, privacy compliance, and production monitoring with real examples across Claude and OpenAI.

Key Features

  • LLM Observability — trace every LLM call, nested span, and event across Claude and OpenAI
  • Cost Optimization — model routing, prompt optimization, and semantic caching (30–50% cost reduction)
  • Monitoring & Alerting — webhook alerts on cost spikes, extensible to Slack/PagerDuty
  • Privacy & Compliance — PII redaction before logging for GDPR/HIPAA/SOC2

LLM Observability

Three levels of instrumentation:

  • Level 1 (OpenAI wrapper) — zero code changes, wrap the client
  • Level 2 (@observe decorator) — trace any function; nested calls become child spans automatically
  • Level 3 (OpenTelemetry) — automatic tracing for LangChain with no decorators needed

Three types of observations:

  • Generation — LLM API calls: completions, token counts, model costs
  • Span — any operation with duration: DB queries, retrieval steps, processing
  • Event — point-in-time occurrences: cache hits, errors, milestones

Cost Optimization Strategy

  • Smart model routing — classifies task type (simple, code, complex, creative) and routes to the cheapest right-fit model (Haiku → Sonnet → GPT-4o)
  • Prompt optimization — strips filler phrases and deduplicates instructions to reduce input tokens
  • Semantic caching — uses sentence-transformers + ChromaDB to return cached responses for queries with >92% similarity; persists across restarts
  • Combined approach targets 50–70% reduction in LLM spend

Monitoring & Alerting

  • Checks hourly cost against a configurable threshold
  • Fires webhook alerts (webhook.site, Slack, Discord, PagerDuty)

Privacy & Compliance

  • Redacts PII (email, phone, SSN, credit card, IP) from prompts and responses before sending to Langfuse
  • Recursive dictionary redaction for nested payloads

RAG Pipeline

End-to-end retrieval-augmented generation with full trace visibility: document indexing → embedding → semantic retrieval → Claude generation, with token counts and cost tracked per step.

LangfuseLangchainPythonAnthropicOpenAIDebuggingPrivacyOpentelemetryChromaDBCost Optimization