Lesson 7

LLM Deployment

FastAPI, Docker, K8s

The Problem: You've built an AI feature. Now how do you deploy it to production safely? How do you handle updates, rollbacks, and scaling?

The Solution: Launch Carefully Like a Rocket

LLM deployment is the work of moving an AI feature from your laptop into a service that real users hit — safely, repeatably, and with a way back if something breaks. The model itself is only part of it. A production deployment is a whole system: an inference endpoint (a hosted API like Anthropic or OpenAI, or your own server running an open model), the application code that builds prompts and parses responses, plus rate limiting, retries, caching, monitoring, and a rollback plan. It's like launching a rocket — the engine matters, but so do the launch checklist, the staged ignition, and the telemetry you watch the whole way up.

How it works

Most teams start by calling a managed API: you send a request, you get tokens back, and the provider handles GPUs and scaling. That's the fastest path to production, but you pay per token and depend on someone else's uptime. If you self-host an open model (Llama, Mistral, Qwen) with a server like vLLM, TGI, or Ollama, you control cost and data residency, but you now own GPU provisioning, batching, and scaling. Because LLM calls are latency-heavy and can fail in many ways — timeouts, rate limits, content filters, truncated or malformed JSON — every one of those failure modes needs its own handler. Two techniques make this manageable: quantization, which shrinks a self-hosted model so it fits on cheaper hardware, and observability, which means logging latency, error rates, and token usage so you can actually see what production is doing instead of guessing.

Tradeoffs and a worked example

The core tension is speed of shipping vs. control of risk. Shipping straight to 100% of users is fast but turns every bug into an incident; a staged rollout is slower but contains the blast radius. Say you're replacing your support bot's model with a newer one. A safe deploy looks like this: first run it as a shadow — the new model answers in the background, you log its replies but never show them to users, and compare quality offline. Then do a canary: route 5% of live traffic to the new model, watch error rate and latency for an hour, and only widen to 25%, 50%, 100% if the dashboards stay green. Keep the old version warm so a single config change rolls you back in under five minutes. The lesson: never let a deploy be a one-way door — every release should be observable while it ramps and reversible if it misbehaves.

Think of it like a rocket launch:

1. Rate limiting configured: Per-user and per-IP limits prevent abuse and runaway costs
2. API keys secured: Not in client code, stored in environment variables, rotated regularly
3. Error handling for all LLM failure modes: Timeout, rate limit, content filter, malformed response — each has a dedicated handler
4. Monitoring and alerting live: Dashboards track latency, error rates, token usage; alerts fire on anomalies
5. Fallback behavior defined: When LLM is down: cached responses, simplified non-LLM answers, or a friendly "try again" message
6. Load testing passed: System tested at 2x expected peak traffic — no crashes, acceptable latency
7. Rollback strategy defined: Canary: stop traffic shift. Blue-green: switch back. Self-hosted: revert to previous model checkpoint. Every deploy must be reversible in under 5 minutes.
8. A/B evaluation pipeline: Compare old vs. new model outputs on real traffic. Track quality metrics (accuracy, relevance scores) alongside latency and cost before full rollout.

Production checklist: 1 bug in staging is cheaper than 1,000 bugs in production. Test every failure mode before launch.

Deployment Options

API-based: Use OpenAI, Anthropic, etc. — easiest
Self-hosted: Run open models on your infrastructure
Hybrid: API for complex tasks, self-hosted for simple ones
Edge: Small models running on user devices
Graceful Degradation: When the LLM is slow or down, show cached responses, simplified non-LLM answers, or a friendly "try again" message
Canary Deployment: Route 5-10% of traffic to the new version first. Monitor error rates and latency. If stable, gradually increase to 100%.
Blue-Green Deployment: Run two identical environments: "blue" (current) and "green" (new). Switch traffic instantly with one DNS/load balancer change. Instant rollback by switching back.
Self-hosted LLMs: Deploy open-source models (Llama, Mistral) via vLLM, TGI, or Ollama. GPU provisioning, quantization (GPTQ/AWQ), and auto-scaling are key challenges.

Fun Fact: Many companies use "shadow deployment" first — running the new AI alongside the old system without showing results to users. This lets you compare outputs and catch issues before real deployment.

Try It Yourself!

Use the deployment checklist to ensure your AI application is ready for production.

Frequently asked questions

What is the difference between canary and blue-green deployment?

Canary gradually shifts traffic to the new version (e.g. 5% → 25% → 100%), watching error rate and latency at each step, so a bug only hits a fraction of users. Blue-green runs two identical environments and switches all traffic at once with a single DNS or load-balancer change; rollback is instant by switching back. Canary is safer for gradual risk control, blue-green is faster to switch and revert.

Should I deploy an LLM via API or self-hosted?

A managed API (Anthropic, OpenAI) is the fastest path to production: the provider handles GPUs and scaling, you pay per token and depend on their uptime. Self-hosting an open model (Llama, Mistral, Qwen) with vLLM, TGI, or Ollama gives you control over cost and data residency, but you own GPU provisioning, quantization, batching, and auto-scaling. Many teams use a hybrid: API for complex tasks, self-hosted for simple high-volume ones.

What is shadow deployment?

Shadow deployment runs the new model alongside the old one, answering real requests in the background, but its outputs are only logged and never shown to users. This lets you compare the new model's quality against the old one on live traffic and catch problems before they reach anyone. It is usually the first step before a canary rollout.

How do I safely roll back an LLM deployment?

Every deploy should be reversible in under five minutes. For canary, just stop shifting traffic to the new version. For blue-green, switch the load balancer back to the old environment. For a self-hosted model, revert to the previous checkpoint. The key rule: keep the old version warm and never make a deploy a one-way door.

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerCoding

Safely deploy an LLM app: proxy instead of direct API key in client

Without technique

Prompt

How to connect OpenAI API to my React app?

Response

// React component
const response = await fetch("https://api.openai.com/v1/chat/completions", {
  headers: { "Authorization": "Bearer sk-abc123..." },
  body: JSON.stringify({ model: "gpt-4", messages })
});

Tokens:18/55

Time:420ms

Quality:

With Deployment

With technique

Prompt

You are a senior backend developer. Design a secure proxy for the LLM API. Requirements: - API key only on server (never in client) - Rate limiting: 10 requests/min per user - Input validation - Logging for cost monitoring Show architecture: client → proxy → LLM API. Pseudocode for proxy on Node.js/Express.

Response

Architecture: React → /api/chat (your server) → OpenAI API

// server.js (Express proxy)
const rateLimit = require("express-rate-limit");

const limiter = rateLimit({
  windowMs: 60 * 1000,   // 1 minute
  max: 10,               // 10 req/min/user
  keyGenerator: (req) => req.user.id
});

app.post("/api/chat", limiter, async (req, res) => {
  // 1. Validation
  const { message } = req.body;
  if (!message || message.length > 4000) {
    return res.status(400).json({ error: "Invalid input" });
  }

  // 2. LLM call (key on server)
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [{ role: "user", content: message }]
  }); // API key from process.env.OPENAI_API_KEY

  // 3. Logging
  logger.info({ user: req.user.id, tokens: response.usage });

  res.json({ reply: response.choices[0].message.content });
});

Never: sk-... in client code, .env in git, CORS without restrictions.

👁️API key in client = leak. Anyone can open DevTools and copy the key

🧠Proxy server solves it: key on backend + rate limit + validation + logging

🔍Rate limiting protects from abuse: without it, one user can exhaust the entire budget

✅Production pattern: client → your proxy (auth + rate limit + logging) → LLM API

Tokens:95/280

Time:1800ms

Quality:

Why this works

Never put the API key in client code. Production pattern: client → proxy on your server (auth, rate limit, validation, logging) → LLM API. This protects both the key and the budget.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Model Selection Api Patterns

This lesson is part of a structured LLM course.

My Learning Path

Lesson 7

LLM Deployment

FastAPI, Docker, K8s

The Problem: You've built an AI feature. Now how do you deploy it to production safely? How do you handle updates, rollbacks, and scaling?

The Solution: Launch Carefully Like a Rocket

How it works

Tradeoffs and a worked example

Think of it like a rocket launch:

1. Rate limiting configured: Per-user and per-IP limits prevent abuse and runaway costs
2. API keys secured: Not in client code, stored in environment variables, rotated regularly
3. Error handling for all LLM failure modes: Timeout, rate limit, content filter, malformed response — each has a dedicated handler
4. Monitoring and alerting live: Dashboards track latency, error rates, token usage; alerts fire on anomalies
5. Fallback behavior defined: When LLM is down: cached responses, simplified non-LLM answers, or a friendly "try again" message
6. Load testing passed: System tested at 2x expected peak traffic — no crashes, acceptable latency
7. Rollback strategy defined: Canary: stop traffic shift. Blue-green: switch back. Self-hosted: revert to previous model checkpoint. Every deploy must be reversible in under 5 minutes.
8. A/B evaluation pipeline: Compare old vs. new model outputs on real traffic. Track quality metrics (accuracy, relevance scores) alongside latency and cost before full rollout.

Production checklist: 1 bug in staging is cheaper than 1,000 bugs in production. Test every failure mode before launch.

Deployment Options

API-based: Use OpenAI, Anthropic, etc. — easiest
Self-hosted: Run open models on your infrastructure
Hybrid: API for complex tasks, self-hosted for simple ones
Edge: Small models running on user devices
Graceful Degradation: When the LLM is slow or down, show cached responses, simplified non-LLM answers, or a friendly "try again" message
Canary Deployment: Route 5-10% of traffic to the new version first. Monitor error rates and latency. If stable, gradually increase to 100%.
Blue-Green Deployment: Run two identical environments: "blue" (current) and "green" (new). Switch traffic instantly with one DNS/load balancer change. Instant rollback by switching back.
Self-hosted LLMs: Deploy open-source models (Llama, Mistral) via vLLM, TGI, or Ollama. GPU provisioning, quantization (GPTQ/AWQ), and auto-scaling are key challenges.

Try It Yourself!

Use the deployment checklist to ensure your AI application is ready for production.

Frequently asked questions

What is the difference between canary and blue-green deployment?

Should I deploy an LLM via API or self-hosted?

What is shadow deployment?

How do I safely roll back an LLM deployment?

Try it yourself

Interactive demo of this technique

Technique Comparison

Demo Mode

Pre-recorded responses

TaskBeginnerCoding

Safely deploy an LLM app: proxy instead of direct API key in client

Without technique

Prompt

How to connect OpenAI API to my React app?

Response

// React component
const response = await fetch("https://api.openai.com/v1/chat/completions", {
  headers: { "Authorization": "Bearer sk-abc123..." },
  body: JSON.stringify({ model: "gpt-4", messages })
});

Tokens:18/55

Time:420ms

Quality:

With Deployment

With technique

Prompt

Response

Architecture: React → /api/chat (your server) → OpenAI API

// server.js (Express proxy)
const rateLimit = require("express-rate-limit");

const limiter = rateLimit({
  windowMs: 60 * 1000,   // 1 minute
  max: 10,               // 10 req/min/user
  keyGenerator: (req) => req.user.id
});

app.post("/api/chat", limiter, async (req, res) => {
  // 1. Validation
  const { message } = req.body;
  if (!message || message.length > 4000) {
    return res.status(400).json({ error: "Invalid input" });
  }

  // 2. LLM call (key on server)
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [{ role: "user", content: message }]
  }); // API key from process.env.OPENAI_API_KEY

  // 3. Logging
  logger.info({ user: req.user.id, tokens: response.usage });

  res.json({ reply: response.choices[0].message.content });
});

Never: sk-... in client code, .env in git, CORS without restrictions.

👁️API key in client = leak. Anyone can open DevTools and copy the key

🧠Proxy server solves it: key on backend + rate limit + validation + logging

🔍Rate limiting protects from abuse: without it, one user can exhaust the entire budget

✅Production pattern: client → your proxy (auth + rate limit + logging) → LLM API

Tokens:95/280

Time:1800ms

Quality:

Why this works

Never put the API key in client code. Production pattern: client → proxy on your server (auth, rate limit, validation, logging) → LLM API. This protects both the key and the budget.

1 / 2

Practice Challenges

Create a free account to solve challenges

4 AI-verified challenges for this lesson

Related lessons:Model Selection Api Patterns

This lesson is part of a structured LLM course.

My Learning Path