API Integration Patterns
Streaming, retries, errors
The Problem: How do you integrate LLMs into your application reliably? What about timeouts, rate limits, retries, and streaming?
The Solution: Standard Connection Methods
API patterns are proven approaches for reliably integrating LLMs into applications. They're like standardized connectors — USB, HDMI, or power plugs — that make it easy to connect different systems together. Streaming reduces perceived latency, and structured output ensures predictable responses.
Think of it like standardized connectors like USB or HDMI:
- 1. Choose endpoint: Pick the right API: chat completions, embeddings, or function calling based on your task
- 2. Set up streaming: Stream tokens for real-time UX — users see the first token in ~200ms instead of waiting seconds
- 3. Retry with exponential backoff: Automatically retry on 429 (rate limit), 500 (server error), 503 (overloaded) with increasing delays
- 4. Add timeout and circuit breaker: Set request timeouts and stop calling a failing provider after N consecutive errors
- 5. Handle rate limits gracefully: Queue excess requests, apply backpressure, and show users a meaningful wait indicator
Where Is This Used?
- Chatbots: Streaming for real-time response display
- Batch Processing: Retries and rate limiting for bulk jobs
- High-Availability: Fallback models and circuit breakers
- Cost Control: Request queuing and prioritization
- Multi-Provider Fallback: If primary provider is down, seamlessly switch to backup (e.g., Claude → OpenAI → local model)
Fun Fact: Streaming can reduce perceived latency by 5-10x. Users see the first token in ~200ms instead of waiting 3-5 seconds for the complete response. This single change often has the biggest impact on user satisfaction.
Try It Yourself!
See different API patterns in action and understand how they improve reliability.
Streaming Response
import anthropic
client = anthropic.Anthropic()
# Stream tokens as they're generated
with client.messages.stream(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain vector databases"}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)Best Practices
- Streaming: Always use for chat and long responses
- Timeouts: Set reasonable limits (30-60s for LLMs)
- Error handling: Log all errors, show user-friendly messages
- Monitoring: Track latency, error rate, token usage
Try it yourself
Interactive demo of this technique
Handle errors in LLM API calls with retry logic
import openai
def ask_llm(prompt):
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
import openai
import time
import logging
from typing import Optional
logger = logging.getLogger(__name__)
def ask_llm(prompt: str, max_retries: int = 3) -> Optional[str]:
"""Call LLM with exponential backoff."""
for attempt in range(max_retries):
try:
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
timeout=30
)
return response.choices[0].message.content
except openai.RateLimitError: # 429 — too many requests
wait = 2 ** attempt # 1s, 2s, 4s
logger.warning(f"Rate limit, retry {attempt+1}/{max_retries} in {wait}s")
time.sleep(wait)
except openai.APIStatusError as e: # 5xx — server error
if e.status_code >= 500:
wait = 2 ** attempt
logger.error(f"Server error {e.status_code}, retry in {wait}s")
time.sleep(wait)
else:
raise # 4xx (except 429) — don't retry
except openai.APITimeoutError:
logger.error(f"Timeout, attempt {attempt+1}/{max_retries}")
logger.critical(f"All {max_retries} retries failed")
return None
In production, LLM API is called with retry (exponential backoff), error separation (429 retry vs 400 fail) and logging. Without this, the app crashes at the first rate limit.
Create a free account to solve challenges
4 AI-verified challenges for this lesson
This lesson is part of a structured LLM course.
My Learning Path