LLM Deployment
FastAPI, Docker, K8s
The Problem: You've built an AI feature. Now how do you deploy it to production safely? How do you handle updates, rollbacks, and scaling?
The Solution: Launch Carefully Like a Rocket
LLM deployment involves getting your AI application into production safely, with proper testing, monitoring, and rollback capabilities. It's like launching a rocket — careful preparation, staged deployment, and constant monitoring. For self-hosted models, quantization reduces resource needs, and observability is essential once you go live.
Think of it like a rocket launch:
- 1. Rate limiting configured: Per-user and per-IP limits prevent abuse and runaway costs
- 2. API keys secured: Not in client code, stored in environment variables, rotated regularly
- 3. Error handling for all LLM failure modes: Timeout, rate limit, content filter, malformed response — each has a dedicated handler
- 4. Monitoring and alerting live: Dashboards track latency, error rates, token usage; alerts fire on anomalies
- 5. Fallback behavior defined: When LLM is down: cached responses, simplified non-LLM answers, or a friendly "try again" message
- 6. Load testing passed: System tested at 2x expected peak traffic — no crashes, acceptable latency
- 7. Rollback strategy defined: Canary: stop traffic shift. Blue-green: switch back. Self-hosted: revert to previous model checkpoint. Every deploy must be reversible in under 5 minutes.
- 8. A/B evaluation pipeline: Compare old vs. new model outputs on real traffic. Track quality metrics (accuracy, relevance scores) alongside latency and cost before full rollout.
Production checklist: 1 bug in staging is cheaper than 1,000 bugs in production. Test every failure mode before launch.
Deployment Options
- API-based: Use OpenAI, Anthropic, etc. — easiest
- Self-hosted: Run open models on your infrastructure
- Hybrid: API for complex tasks, self-hosted for simple ones
- Edge: Small models running on user devices
- Graceful Degradation: When the LLM is slow or down, show cached responses, simplified non-LLM answers, or a friendly "try again" message
- Canary Deployment: Route 5-10% of traffic to the new version first. Monitor error rates and latency. If stable, gradually increase to 100%.
- Blue-Green Deployment: Run two identical environments: "blue" (current) and "green" (new). Switch traffic instantly with one DNS/load balancer change. Instant rollback by switching back.
- Self-hosted LLMs: Deploy open-source models (Llama, Mistral) via vLLM, TGI, or Ollama. GPU provisioning, quantization (GPTQ/AWQ), and auto-scaling are key challenges.
Fun Fact: Many companies use "shadow deployment" first — running the new AI alongside the old system without showing results to users. This lets you compare outputs and catch issues before real deployment.
Try It Yourself!
Use the deployment checklist to ensure your AI application is ready for production.
Try it yourself
Interactive demo of this technique
Safely deploy an LLM app: proxy instead of direct API key in client
// React component
const response = await fetch("https://api.openai.com/v1/chat/completions", {
headers: { "Authorization": "Bearer sk-abc123..." },
body: JSON.stringify({ model: "gpt-4", messages })
});
Architecture: React → /api/chat (your server) → OpenAI API
// server.js (Express proxy)
const rateLimit = require("express-rate-limit");
const limiter = rateLimit({
windowMs: 60 * 1000, // 1 minute
max: 10, // 10 req/min/user
keyGenerator: (req) => req.user.id
});
app.post("/api/chat", limiter, async (req, res) => {
// 1. Validation
const { message } = req.body;
if (!message || message.length > 4000) {
return res.status(400).json({ error: "Invalid input" });
}
// 2. LLM call (key on server)
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: message }]
}); // API key from process.env.OPENAI_API_KEY
// 3. Logging
logger.info({ user: req.user.id, tokens: response.usage });
res.json({ reply: response.choices[0].message.content });
});
Never: sk-... in client code, .env in git, CORS without restrictions.
Never put the API key in client code. Production pattern: client → proxy on your server (auth, rate limit, validation, logging) → LLM API. This protects both the key and the budget.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path