LLM Observability
Monitoring & debugging
The Problem: AI in production is a black box. How do you know when it's making mistakes, slowing down, or costing too much? How do you debug issues?
The Solution: Health Monitoring for Your AI
LLM observability means tracking, logging, and monitoring every aspect of your AI system once it is serving real users. A traditional web service is mostly deterministic: the same input produces the same output, and a 200 status code usually means "all good." An LLM feature is different. The model can return a fluent answer that is confidently wrong, the same prompt can cost twice as much tomorrow, and a small change to your system prompt can silently degrade quality for thousands of users. Observability is the discipline of making that black box readable — it is like the patient monitoring system in a hospital: vital signs on a screen, alarms when something crosses a threshold, and a detailed chart of everything that happened so you can reconstruct what went wrong after the fact.
How it works
In practice you instrument every model call and emit a structured event for it. Three signal types do most of the work. Metrics are cheap aggregate numbers you watch on dashboards: latency (track p50, p95 and p99, not just the average), cost per request, error rate, and tokens in/out. Logs capture the full payload of individual calls — the prompt, the response, the model name, and token counts — so you can inspect a specific bad answer. Traces stitch those logs into one timeline for a single user request, so a multi-step pipeline (retrieve context, build prompt, call the model, run a guardrail) shows up as a waterfall where you can see exactly which span ate the time. Because the model is non-deterministic, you also track quality signals that no HTTP status reveals: hallucination rate, a thumbs up/down ratio, and offline relevance scores.
Tradeoffs and a worked example
Add observability from day one — retrofitting it after an incident is painful, and teams that instrument early routinely find 30-40% of their spend is wasted on a handful of bloated prompts. The main pitfalls: logging full prompts and responses can capture user PII, so redact or hash sensitive fields; and high-cardinality dashboards get expensive, so sample verbose traces rather than storing every byte. A concrete example: your chat feature's p95 latency suddenly jumps from 1.2s to 4.5s. Without observability you are guessing. With a trace you open one slow request and see the breakdown — request 5ms, prompt build 120ms, LLM API 4,200ms, guardrail 50ms. The model call alone is the bottleneck. You check the logged token counts and find the input ballooned from 800 to 9,000 tokens after a recent change started stuffing the entire chat history into context. The fix — truncating old turns and adding prompt caching for the static system prompt — restores latency and cuts cost at the same time, all because the data made the problem visible.
Think of it like patient monitoring in a hospital:
- 1. Instrument all LLM calls: Log every request with: prompt text, response text, token counts (input/output), latency (ms), model used, and cost
- 2. Set up dashboards: Track cost/day, p95 latency, error rate, and tokens/request — visualize trends over time, not just current values
- 3. Create alerts: Alert on: cost spike > 2x daily average, p95 latency > 5s, error rate > 5%, quality score drop > 10%
- 4. Review traces for bottlenecks: Drill into slow requests — is it oversized context? Missing cache? Wrong model? Each trace tells a story
- 5. Iterate prompts based on data: Use observability data to find and fix the worst-performing prompts first — the top 10% of costly prompts usually account for 50%+ of spend
What to Monitor
- Performance: Track p50, p95, p99 latency per endpoint — p95 > 3s means 5% of users are waiting too long
- Debug This Trace: Trace example: request (5ms) -> prompt build (120ms) -> LLM API (4,200ms) -> guardrail (50ms) -> response. LLM step is 10x slower than expected — possible causes: oversized context, model congestion, or missing cache hit
- Cost: Track cost-per-conversation (not just per-request) — a multi-turn chat can cost 10-50x more than a single exchange
- Quality: Track hallucination rate, user thumbs-up/down ratio, and relevance scores — quality regression is silent without metrics
Fun Fact: Teams that add observability from day 1 typically find 30-40% cost savings opportunities within the first month just by seeing their actual usage patterns — most discover that their longest prompts are also the least effective ones.
Try It Yourself!
See what LLM monitoring looks like and understand the key metrics to track.
Frequently asked questions
What is LLM observability?
LLM observability means tracking, logging, and monitoring every aspect of an AI system in production: latency, cost, token counts, error rate, and quality signals like hallucination rate and thumbs up/down. Unlike a deterministic web service, an LLM can return a confidently wrong answer with a 200 status, so HTTP metrics are not enough. Observability makes the black box readable through dashboards, threshold alerts, and detailed traces you can replay after an incident.
What is the difference between logs, metrics, and traces for LLMs?
Metrics are cheap aggregates for dashboards: p50/p95/p99 latency, cost per request, error rate, and tokens in/out. Logs capture the full payload of a single call (prompt, response, model name, token counts) so you can inspect one bad answer. Traces stitch logs into a single timeline for one user request, turning a multi-step pipeline (retrieve context, build prompt, call model, run guardrail) into a waterfall that shows which span ate the time.
Which LLM metrics should I monitor in production?
At minimum: latency by percentile (p50, p95, p99, not the average), cost per request and per conversation, input/output token counts, and error rate. Add quality signals that no HTTP status reveals: hallucination rate, thumbs up/down ratio, and offline relevance scores. Useful alerts: cost spike over 2x the daily average, p95 latency over 5s, error rate over 5%, and a quality drop over 10%.
How does observability reduce LLM cost and latency?
Data makes the problem visible. Example: chat p95 latency jumps from 1.2s to 4.5s. You open a slow trace and see the model call alone took 4,200ms. The logged token count shows input ballooned from 800 to 9,000 tokens because a recent change started stuffing the whole chat history into context. The fix (truncate old turns plus prompt caching for the static system prompt) restores latency and cuts cost at once. Teams that instrument early often find 30-40% of spend wasted on a few bloated prompts.
Try it yourself
Interactive demo of this technique
Debug LLM response quality issue in production
Try rewriting the prompt and see if it gets better. Maybe the model isn't suitable for the task.
1. Per-request logging:
{
"request_id": "req_abc123",
"timestamp": "2024-01-15T10:30:00Z",
"user_id": "u_456",
"prompt": "...",
"response": "...",
"model": "gpt-4",
"tokens_in": 150,
"tokens_out": 200,
"latency_ms": 1200,
"temperature": 0.7,
"user_rating": null
}
2. Metrics (Prometheus/Grafana):
llm_latency_p95: < 3s (alert > 5s)llm_error_rate: < 1% (alert > 5%)llm_token_cost_hourly: track trendllm_user_satisfaction: thumbs up/down ratio
3. Alerts:
- Latency P95 > 5s → PagerDuty
- Error rate > 5% for 5 min → Slack
- Satisfaction < 70% for 1 hour → email team
LLM observability minimum: log every request (prompt + response + metrics) + dashboard (latency, error rate, satisfaction) + alerts. Without this, you're debugging blind.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path