Lesson 4

LLM Observability

Monitoring & debugging

The Problem: AI in production is a black box. How do you know when it's making mistakes, slowing down, or costing too much? How do you debug issues?

The Solution: Health Monitoring for Your AI

LLM observability means tracking, logging, and monitoring every aspect of your AI system once it is serving real users. A traditional web service is mostly deterministic: the same input produces the same output, and a 200 status code usually means "all good." An LLM feature is different. The model can return a fluent answer that is confidently wrong, the same prompt can cost twice as much tomorrow, and a small change to your system prompt can silently degrade quality for thousands of users. Observability is the discipline of making that black box readable — it is like the patient monitoring system in a hospital: vital signs on a screen, alarms when something crosses a threshold, and a detailed chart of everything that happened so you can reconstruct what went wrong after the fact.

How it works

In practice you instrument every model call and emit a structured event for it. Three signal types do most of the work. Metrics are cheap aggregate numbers you watch on dashboards: latency (track p50, p95 and p99, not just the average), cost per request, error rate, and tokens in/out. Logs capture the full payload of individual calls — the prompt, the response, the model name, and token counts — so you can inspect a specific bad answer. Traces stitch those logs into one timeline for a single user request, so a multi-step pipeline (retrieve context, build prompt, call the model, run a guardrail) shows up as a waterfall where you can see exactly which span ate the time. Because the model is non-deterministic, you also track quality signals that no HTTP status reveals: hallucination rate, a thumbs up/down ratio, and offline relevance scores.

Tradeoffs and a worked example

Add observability from day one — retrofitting it after an incident is painful, and teams that instrument early routinely find 30-40% of their spend is wasted on a handful of bloated prompts. The main pitfalls: logging full prompts and responses can capture user PII, so redact or hash sensitive fields; and high-cardinality dashboards get expensive, so sample verbose traces rather than storing every byte. A concrete example: your chat feature's p95 latency suddenly jumps from 1.2s to 4.5s. Without observability you are guessing. With a trace you open one slow request and see the breakdown — request 5ms, prompt build 120ms, LLM API 4,200ms, guardrail 50ms. The model call alone is the bottleneck. You check the logged token counts and find the input ballooned from 800 to 9,000 tokens after a recent change started stuffing the entire chat history into context. The fix — truncating old turns and adding prompt caching for the static system prompt — restores latency and cuts cost at the same time, all because the data made the problem visible.

Think of it like patient monitoring in a hospital:

1. Instrument all LLM calls: Log every request with: prompt text, response text, token counts (input/output), latency (ms), model used, and cost
2. Set up dashboards: Track cost/day, p95 latency, error rate, and tokens/request — visualize trends over time, not just current values
3. Create alerts: Alert on: cost spike > 2x daily average, p95 latency > 5s, error rate > 5%, quality score drop > 10%
4. Review traces for bottlenecks: Drill into slow requests — is it oversized context? Missing cache? Wrong model? Each trace tells a story
5. Iterate prompts based on data: Use observability data to find and fix the worst-performing prompts first — the top 10% of costly prompts usually account for 50%+ of spend

What to Monitor

Performance: Track p50, p95, p99 latency per endpoint — p95 > 3s means 5% of users are waiting too long
Debug This Trace: Trace example: request (5ms) -> prompt build (120ms) -> LLM API (4,200ms) -> guardrail (50ms) -> response. LLM step is 10x slower than expected — possible causes: oversized context, model congestion, or missing cache hit
Cost: Track cost-per-conversation (not just per-request) — a multi-turn chat can cost 10-50x more than a single exchange
Quality: Track hallucination rate, user thumbs-up/down ratio, and relevance scores — quality regression is silent without metrics

Fun Fact: Teams that add observability from day 1 typically find 30-40% cost savings opportunities within the first month just by seeing their actual usage patterns — most discover that their longest prompts are also the least effective ones.

Try It Yourself!

See what LLM monitoring looks like and understand the key metrics to track.

Frequently asked questions

What is LLM observability?

LLM observability means tracking, logging, and monitoring every aspect of an AI system in production: latency, cost, token counts, error rate, and quality signals like hallucination rate and thumbs up/down. Unlike a deterministic web service, an LLM can return a confidently wrong answer with a 200 status, so HTTP metrics are not enough. Observability makes the black box readable through dashboards, threshold alerts, and detailed traces you can replay after an incident.

What is the difference between logs, metrics, and traces for LLMs?

Metrics are cheap aggregates for dashboards: p50/p95/p99 latency, cost per request, error rate, and tokens in/out. Logs capture the full payload of a single call (prompt, response, model name, token counts) so you can inspect one bad answer. Traces stitch logs into a single timeline for one user request, turning a multi-step pipeline (retrieve context, build prompt, call model, run guardrail) into a waterfall that shows which span ate the time.

Which LLM metrics should I monitor in production?

At minimum: latency by percentile (p50, p95, p99, not the average), cost per request and per conversation, input/output token counts, and error rate. Add quality signals that no HTTP status reveals: hallucination rate, thumbs up/down ratio, and offline relevance scores. Useful alerts: cost spike over 2x the daily average, p95 latency over 5s, error rate over 5%, and a quality drop over 10%.

How does observability reduce LLM cost and latency?

Data makes the problem visible. Example: chat p95 latency jumps from 1.2s to 4.5s. You open a slow trace and see the model call alone took 4,200ms. The logged token count shows input ballooned from 800 to 9,000 tokens because a recent change started stuffing the whole chat history into context. The fix (truncate old turns plus prompt caching for the static system prompt) restores latency and cuts cost at once. Teams that instrument early often find 30-40% of spend wasted on a few bloated prompts.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Debug LLM response quality issue in production

Without technique

Prompt

Users complain that the bot answers incorrectly. No logs. How to find the problem?

Response

Try rewriting the prompt and see if it gets better. Maybe the model isn't suitable for the task.

Tokens:25/30

Time:350ms

Quality:

With Observability

With technique

Prompt

You are an SRE engineer specializing in LLM systems. Problem: users complain about incorrect bot responses. No logs. Design a minimal observability system for the LLM: 1. What to log per request? 2. What metrics to collect? 3. What alerts to set up? Format: specific fields/metrics with example values.

Response

1. Per-request logging:

{
  "request_id": "req_abc123",
  "timestamp": "2024-01-15T10:30:00Z",
  "user_id": "u_456",
  "prompt": "...",
  "response": "...",
  "model": "gpt-4",
  "tokens_in": 150,
  "tokens_out": 200,
  "latency_ms": 1200,
  "temperature": 0.7,
  "user_rating": null
}

2. Metrics (Prometheus/Grafana):

llm_latency_p95: < 3s (alert > 5s)
llm_error_rate: < 1% (alert > 5%)
llm_token_cost_hourly: track trend
llm_user_satisfaction: thumbs up/down ratio

3. Alerts:

Latency P95 > 5s → PagerDuty
Error rate > 5% for 5 min → Slack
Satisfaction < 70% for 1 hour → email team

👁️Without logs, impossible to tell: is it a bad prompt, wrong temperature, or hallucinations?

🧠Minimum set: request_id + prompt + response + metrics → reproducibility of any issue

✅Alerts on latency + error rate + satisfaction catch issues BEFORE mass complaints

Tokens:95/250

Time:1500ms

Quality:

Why this works

LLM observability minimum: log every request (prompt + response + metrics) + dashboard (latency, error rate, satisfaction) + alerts. Without this, you're debugging blind.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Cost Optimization Guardrails

This lesson is part of a structured LLM course.

My Learning Path

Lesson 4

LLM Observability

Monitoring & debugging

The Problem: AI in production is a black box. How do you know when it's making mistakes, slowing down, or costing too much? How do you debug issues?

The Solution: Health Monitoring for Your AI

How it works

Tradeoffs and a worked example

Think of it like patient monitoring in a hospital:

1. Instrument all LLM calls: Log every request with: prompt text, response text, token counts (input/output), latency (ms), model used, and cost
2. Set up dashboards: Track cost/day, p95 latency, error rate, and tokens/request — visualize trends over time, not just current values
3. Create alerts: Alert on: cost spike > 2x daily average, p95 latency > 5s, error rate > 5%, quality score drop > 10%
4. Review traces for bottlenecks: Drill into slow requests — is it oversized context? Missing cache? Wrong model? Each trace tells a story
5. Iterate prompts based on data: Use observability data to find and fix the worst-performing prompts first — the top 10% of costly prompts usually account for 50%+ of spend

What to Monitor

Performance: Track p50, p95, p99 latency per endpoint — p95 > 3s means 5% of users are waiting too long
Debug This Trace: Trace example: request (5ms) -> prompt build (120ms) -> LLM API (4,200ms) -> guardrail (50ms) -> response. LLM step is 10x slower than expected — possible causes: oversized context, model congestion, or missing cache hit
Cost: Track cost-per-conversation (not just per-request) — a multi-turn chat can cost 10-50x more than a single exchange
Quality: Track hallucination rate, user thumbs-up/down ratio, and relevance scores — quality regression is silent without metrics

Try It Yourself!

See what LLM monitoring looks like and understand the key metrics to track.

Frequently asked questions

What is LLM observability?

What is the difference between logs, metrics, and traces for LLMs?

Which LLM metrics should I monitor in production?

How does observability reduce LLM cost and latency?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerAnalysis

Debug LLM response quality issue in production

Without technique

Prompt

Users complain that the bot answers incorrectly. No logs. How to find the problem?

Response

Try rewriting the prompt and see if it gets better. Maybe the model isn't suitable for the task.

Tokens:25/30

Time:350ms

Quality:

With Observability

With technique

Prompt

Response

1. Per-request logging:

{
  "request_id": "req_abc123",
  "timestamp": "2024-01-15T10:30:00Z",
  "user_id": "u_456",
  "prompt": "...",
  "response": "...",
  "model": "gpt-4",
  "tokens_in": 150,
  "tokens_out": 200,
  "latency_ms": 1200,
  "temperature": 0.7,
  "user_rating": null
}

2. Metrics (Prometheus/Grafana):

llm_latency_p95: < 3s (alert > 5s)
llm_error_rate: < 1% (alert > 5%)
llm_token_cost_hourly: track trend
llm_user_satisfaction: thumbs up/down ratio

3. Alerts:

Latency P95 > 5s → PagerDuty
Error rate > 5% for 5 min → Slack
Satisfaction < 70% for 1 hour → email team

👁️Without logs, impossible to tell: is it a bad prompt, wrong temperature, or hallucinations?

🧠Minimum set: request_id + prompt + response + metrics → reproducibility of any issue

✅Alerts on latency + error rate + satisfaction catch issues BEFORE mass complaints

Tokens:95/250

Time:1500ms

Quality:

Why this works

LLM observability minimum: log every request (prompt + response + metrics) + dashboard (latency, error rate, satisfaction) + alerts. Without this, you're debugging blind.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Cost Optimization Guardrails

This lesson is part of a structured LLM course.

My Learning Path