LLM Deployment
FastAPI, Docker, K8s
The Problem: You've built an AI feature. Now how do you deploy it to production safely? How do you handle updates, rollbacks, and scaling?
The Solution: Launch Carefully Like a Rocket
LLM deployment is the work of moving an AI feature from your laptop into a service that real users hit — safely, repeatably, and with a way back if something breaks. The model itself is only part of it. A production deployment is a whole system: an inference endpoint (a hosted API like Anthropic or OpenAI, or your own server running an open model), the application code that builds prompts and parses responses, plus rate limiting, retries, caching, monitoring, and a rollback plan. It's like launching a rocket — the engine matters, but so do the launch checklist, the staged ignition, and the telemetry you watch the whole way up.
How it works
Most teams start by calling a managed API: you send a request, you get tokens back, and the provider handles GPUs and scaling. That's the fastest path to production, but you pay per token and depend on someone else's uptime. If you self-host an open model (Llama, Mistral, Qwen) with a server like vLLM, TGI, or Ollama, you control cost and data residency, but you now own GPU provisioning, batching, and scaling. Because LLM calls are latency-heavy and can fail in many ways — timeouts, rate limits, content filters, truncated or malformed JSON — every one of those failure modes needs its own handler. Two techniques make this manageable: quantization, which shrinks a self-hosted model so it fits on cheaper hardware, and observability, which means logging latency, error rates, and token usage so you can actually see what production is doing instead of guessing.
Tradeoffs and a worked example
The core tension is speed of shipping vs. control of risk. Shipping straight to 100% of users is fast but turns every bug into an incident; a staged rollout is slower but contains the blast radius. Say you're replacing your support bot's model with a newer one. A safe deploy looks like this: first run it as a shadow — the new model answers in the background, you log its replies but never show them to users, and compare quality offline. Then do a canary: route 5% of live traffic to the new model, watch error rate and latency for an hour, and only widen to 25%, 50%, 100% if the dashboards stay green. Keep the old version warm so a single config change rolls you back in under five minutes. The lesson: never let a deploy be a one-way door — every release should be observable while it ramps and reversible if it misbehaves.
Think of it like a rocket launch:
- 1. Rate limiting configured: Per-user and per-IP limits prevent abuse and runaway costs
- 2. API keys secured: Not in client code, stored in environment variables, rotated regularly
- 3. Error handling for all LLM failure modes: Timeout, rate limit, content filter, malformed response — each has a dedicated handler
- 4. Monitoring and alerting live: Dashboards track latency, error rates, token usage; alerts fire on anomalies
- 5. Fallback behavior defined: When LLM is down: cached responses, simplified non-LLM answers, or a friendly "try again" message
- 6. Load testing passed: System tested at 2x expected peak traffic — no crashes, acceptable latency
- 7. Rollback strategy defined: Canary: stop traffic shift. Blue-green: switch back. Self-hosted: revert to previous model checkpoint. Every deploy must be reversible in under 5 minutes.
- 8. A/B evaluation pipeline: Compare old vs. new model outputs on real traffic. Track quality metrics (accuracy, relevance scores) alongside latency and cost before full rollout.
Production checklist: 1 bug in staging is cheaper than 1,000 bugs in production. Test every failure mode before launch.
Deployment Options
- API-based: Use OpenAI, Anthropic, etc. — easiest
- Self-hosted: Run open models on your infrastructure
- Hybrid: API for complex tasks, self-hosted for simple ones
- Edge: Small models running on user devices
- Graceful Degradation: When the LLM is slow or down, show cached responses, simplified non-LLM answers, or a friendly "try again" message
- Canary Deployment: Route 5-10% of traffic to the new version first. Monitor error rates and latency. If stable, gradually increase to 100%.
- Blue-Green Deployment: Run two identical environments: "blue" (current) and "green" (new). Switch traffic instantly with one DNS/load balancer change. Instant rollback by switching back.
- Self-hosted LLMs: Deploy open-source models (Llama, Mistral) via vLLM, TGI, or Ollama. GPU provisioning, quantization (GPTQ/AWQ), and auto-scaling are key challenges.
Fun Fact: Many companies use "shadow deployment" first — running the new AI alongside the old system without showing results to users. This lets you compare outputs and catch issues before real deployment.
Try It Yourself!
Use the deployment checklist to ensure your AI application is ready for production.
Frequently asked questions
What is the difference between canary and blue-green deployment?
Canary gradually shifts traffic to the new version (e.g. 5% → 25% → 100%), watching error rate and latency at each step, so a bug only hits a fraction of users. Blue-green runs two identical environments and switches all traffic at once with a single DNS or load-balancer change; rollback is instant by switching back. Canary is safer for gradual risk control, blue-green is faster to switch and revert.
Should I deploy an LLM via API or self-hosted?
A managed API (Anthropic, OpenAI) is the fastest path to production: the provider handles GPUs and scaling, you pay per token and depend on their uptime. Self-hosting an open model (Llama, Mistral, Qwen) with vLLM, TGI, or Ollama gives you control over cost and data residency, but you own GPU provisioning, quantization, batching, and auto-scaling. Many teams use a hybrid: API for complex tasks, self-hosted for simple high-volume ones.
What is shadow deployment?
Shadow deployment runs the new model alongside the old one, answering real requests in the background, but its outputs are only logged and never shown to users. This lets you compare the new model's quality against the old one on live traffic and catch problems before they reach anyone. It is usually the first step before a canary rollout.
How do I safely roll back an LLM deployment?
Every deploy should be reversible in under five minutes. For canary, just stop shifting traffic to the new version. For blue-green, switch the load balancer back to the old environment. For a self-hosted model, revert to the previous checkpoint. The key rule: keep the old version warm and never make a deploy a one-way door.
Try it yourself
Interactive demo of this technique
Safely deploy an LLM app: proxy instead of direct API key in client
// React component
const response = await fetch("https://api.openai.com/v1/chat/completions", {
headers: { "Authorization": "Bearer sk-abc123..." },
body: JSON.stringify({ model: "gpt-4", messages })
});
Architecture: React → /api/chat (your server) → OpenAI API
// server.js (Express proxy)
const rateLimit = require("express-rate-limit");
const limiter = rateLimit({
windowMs: 60 * 1000, // 1 minute
max: 10, // 10 req/min/user
keyGenerator: (req) => req.user.id
});
app.post("/api/chat", limiter, async (req, res) => {
// 1. Validation
const { message } = req.body;
if (!message || message.length > 4000) {
return res.status(400).json({ error: "Invalid input" });
}
// 2. LLM call (key on server)
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: message }]
}); // API key from process.env.OPENAI_API_KEY
// 3. Logging
logger.info({ user: req.user.id, tokens: response.usage });
res.json({ reply: response.choices[0].message.content });
});
Never: sk-... in client code, .env in git, CORS without restrictions.
Never put the API key in client code. Production pattern: client → proxy on your server (auth, rate limit, validation, logging) → LLM API. This protects both the key and the budget.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path