Learn AI Series (#66) - Working with LLM APIs
What will I learn
- You will learn the LLM provider landscape: OpenAI, Anthropic, Google, and open-source alternatives;
- the chat completions API: messages, roles, and streaming;
- function calling and tool use -- letting LLMs take actions;
- structured outputs and JSON mode;
- rate limiting, error handling, and cost optimization;
- building a multi-provider API client from scratch.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs (this post)
Learn AI Series (#66) - Working with LLM APIs
Solutions to Episode #65 Exercises
Exercise 1: Hybrid search comparison benchmark -- pure semantic vs pure BM25 vs hybrid.
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Knowledge base: 30 chunks, 3 topics, 10 each
chunks = [
# Python (0-9)
"Python uses indentation for code blocks instead of curly braces.",
"List comprehensions provide concise syntax for creating new lists.",
"The GIL prevents true parallel execution of Python threads.",
"Decorators modify function behavior using the @ syntax.",
"Python's asyncio module enables concurrent IO-bound operations.",
"Type hints in Python 3.10+ support union syntax with the pipe operator.",
"Virtual environments isolate project dependencies from the system Python.",
"f-strings format variables directly inside string literals.",
"The walrus operator := assigns values within expressions.",
"Python's match statement provides structural pattern matching.",
# Machine Learning (10-19)
"Gradient descent updates parameters by following the loss gradient.",
"Overfitting happens when models memorize training data instead of learning patterns.",
"Cross-validation splits data into k folds for robust performance estimates.",
"Random forests combine many decision trees using bagging.",
"The learning rate controls how large each gradient update step is.",
"Batch normalization stabilizes training by normalizing layer inputs.",
"Dropout randomly zeroes activations during training for regularization.",
"Adam optimizer combines momentum with adaptive learning rates.",
"Transfer learning reuses pre-trained model weights on new tasks.",
"Feature scaling ensures all input features have comparable ranges.",
# Cooking (20-29)
"The Maillard reaction browns food when proteins and sugars are heated.",
"Emulsification blends oil and water using agents like egg yolk.",
"Braising sears meat then slow-cooks it in liquid at low temperature.",
"Fermentation converts sugars into acids or alcohol using microorganisms.",
"Caramelization occurs when sugar is heated above 170 degrees Celsius.",
"Blanching briefly boils vegetables then plunges them into ice water.",
"Deglazing adds liquid to a hot pan to dissolve browned food residue.",
"Sous vide cooks food in sealed bags at precise water temperatures.",
"Resting meat after cooking lets juices redistribute throughout.",
"A roux combines equal parts fat and flour as a sauce thickener.",
]
# Ground truth: which topic each chunk belongs to
topics = ["python"] * 10 + ["ml"] * 10 + ["cooking"] * 10
topic_indices = {
"python": set(range(0, 10)),
"ml": set(range(10, 20)),
"cooking": set(range(20, 30)),
}
# Queries: 5 per topic, mix of exact and paraphrased
queries = [
# Python -- exact terminology
("What is the GIL in Python?", "python"),
("How do list comprehensions work?", "python"),
# Python -- paraphrased
("How to run code concurrently in Python?", "python"),
("What is structural pattern matching?", "python"),
("How to format strings with variables?", "python"),
# ML -- exact terminology
("What does batch normalization do?", "ml"),
("How does Adam optimizer work?", "ml"),
# ML -- paraphrased
("How to prevent models from memorizing data?", "ml"),
("How to reuse existing model weights?", "ml"),
("Why scale input features?", "ml"),
# Cooking -- exact terminology
("What is the Maillard reaction?", "cooking"),
("How does sous vide cooking work?", "cooking"),
# Cooking -- paraphrased
("How to thicken a sauce with flour?", "cooking"),
("Why let meat sit after cooking?", "cooking"),
("How to make a stable oil and water mixture?", "cooking"),
]
embeddings = model.encode(chunks, normalize_embeddings=True)
tokenized = [c.lower().split() for c in chunks]
bm25 = BM25Okapi(tokenized)
def search(query, alpha, top_k=3):
q_emb = model.encode([query], normalize_embeddings=True)
dense = (embeddings @ q_emb.T).flatten()
sparse = bm25.get_scores(query.lower().split())
d_min, d_max = dense.min(), dense.max()
dense_n = (dense - d_min) / (d_max - d_min + 1e-8)
s_min, s_max = sparse.min(), sparse.max()
sparse_n = (sparse - s_min) / (s_max - s_min + 1e-8)
combined = alpha * dense_n + (1 - alpha) * sparse_n
return np.argsort(combined)[::-1][:top_k]
print(f"{'Mode':<18} {'P@3':>6} {'R@3':>6}")
print("-" * 32)
for name, alpha in [("Pure semantic", 1.0), ("Pure BM25", 0.0),
("Hybrid (0.5)", 0.5)]:
all_p, all_r = [], []
for query_text, topic in queries:
relevant = topic_indices[topic]
retrieved = set(search(query_text, alpha, top_k=3))
p = len(retrieved & relevant) / 3
r = len(retrieved & relevant) / len(relevant)
all_p.append(p)
all_r.append(r)
print(f"{name:<18} {np.mean(all_p):>6.3f} {np.mean(all_r):>6.3f}")
# Show where each approach wins
print("\nSemantic wins (paraphrased queries):")
for qt, topic in queries:
sem = set(search(qt, 1.0))
kw = set(search(qt, 0.0))
rel = topic_indices[topic]
if len(sem & rel) > len(kw & rel):
print(f" '{qt[:50]}' sem={len(sem & rel)} kw={len(kw & rel)}")
print("\nKeyword wins (exact term queries):")
for qt, topic in queries:
sem = set(search(qt, 1.0))
kw = set(search(qt, 0.0))
rel = topic_indices[topic]
if len(kw & rel) > len(sem & rel):
print(f" '{qt[:50]}' kw={len(kw & rel)} sem={len(sem & rel)}")
Paraphrased queries ("How to prevent models from memorizing data?") are where semantic search shines -- it connects "memorizing data" to "overfitting". Exact-term queries ("What is the Maillard reaction?") are where BM25 wins -- the word "Maillard" is a direct hit. Hybrid consistently performs well on both types because it combines both signals.
Exercise 2: Re-ranking evaluation harness with NDCG and MRR comparison.
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
embedder = SentenceTransformer('all-MiniLM-L6-v2')
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
chunks = [
"Python was created by Guido van Rossum in 1991.",
"The transformer was introduced in Attention Is All You Need 2017.",
"RAG combines retrieval with generation to reduce hallucination.",
"FAISS is a Meta library for efficient vector similarity search.",
"Gradient descent computes partial derivatives of the loss function.",
"BERT uses masked language modeling for pre-training.",
"CNNs use convolutional filters to detect spatial patterns.",
"K-means minimizes within-cluster variance for partitioning.",
"Dropout randomly zeroes activations to prevent overfitting.",
"Word2Vec learns dense word vectors from context prediction.",
"Batch normalization normalizes inputs to stabilize training.",
"Adam combines momentum with adaptive per-parameter learning rates.",
"Cosine similarity measures angle between vectors ignoring magnitude.",
"LSTM gates control information flow solving vanishing gradients.",
"Attention lets models focus on relevant input parts per output.",
"GPT is a decoder-only transformer trained on next token prediction.",
"ReLU activation outputs max(0, x) introducing nonlinearity.",
"Softmax converts logits to probability distributions.",
"L2 regularization adds weight magnitude penalty to the loss.",
"Data augmentation creates synthetic training examples from existing data.",
"Embedding layers map discrete tokens to continuous vectors.",
"Cross-entropy loss measures divergence between predicted and true distributions.",
"Learning rate schedulers reduce the rate during training.",
"Skip connections let gradients flow through residual paths.",
"Positional encoding adds sequence order information to transformers.",
"Tokenization splits text into subword units for model input.",
"Beam search explores multiple generation paths simultaneously.",
"Temperature scaling controls sampling randomness in generation.",
"Knowledge distillation trains small models to mimic larger ones.",
"Mixed precision training uses float16 for speed with float32 accumulation.",
"Gradient clipping caps gradient magnitudes to prevent exploding gradients.",
"LayerNorm normalizes across features within each sample.",
"Multi-head attention runs attention in parallel subspaces.",
"Contrastive learning pushes similar pairs together dissimilar apart.",
"Quantization reduces model precision from 32-bit to 8 or 4-bit.",
"RLHF uses human preference data to align model behavior.",
"Nucleus sampling (top-p) dynamically adjusts the token candidate pool.",
"Instruction tuning trains models to follow natural language instructions.",
"Chain-of-thought prompting improves reasoning by showing intermediate steps.",
"Few-shot prompting provides examples in the prompt for task adaptation.",
"Retrieval models encode queries and documents into comparable embeddings.",
"BM25 ranks documents by term frequency and inverse document frequency.",
"Cross-encoders jointly process query-document pairs for accurate relevance.",
"Bi-encoders independently encode queries and documents for fast retrieval.",
"Re-ranking improves initial retrieval by scoring pairs with cross-encoders.",
"Vector databases index embeddings for fast approximate nearest neighbor search.",
"HNSW builds navigable small-world graphs for ANN search.",
"Product quantization compresses vectors by quantizing sub-vectors.",
"Semantic search finds conceptually related content beyond keyword overlap.",
"Hybrid search combines dense retrieval with sparse keyword matching.",
]
embeddings = embedder.encode(chunks, normalize_embeddings=True)
# 10 queries with ground truth
test_queries = [
("Who created Python?", {0}),
("What paper introduced transformers?", {1}),
("How does RAG work?", {2}),
("What library does Meta offer for vector search?", {3}),
("How does gradient descent update weights?", {4}),
("What is BERT's training objective?", {5}),
("How do CNNs detect features?", {6}),
("How does k-means partition data?", {7}),
("How does dropout regularize?", {8}),
("What does attention do in transformers?", {14}),
]
def dcg_at_k(retrieved, relevant, k):
score = 0.0
for i, idx in enumerate(retrieved[:k]):
if idx in relevant:
score += 1.0 / np.log2(i + 2)
return score
def ndcg_at_k(retrieved, relevant, k):
dcg = dcg_at_k(retrieved, relevant, k)
ideal = sum(1.0 / np.log2(i + 2)
for i in range(min(len(relevant), k)))
return dcg / ideal if ideal > 0 else 0.0
def mrr(retrieved, relevant):
for rank, idx in enumerate(retrieved, 1):
if idx in relevant:
return 1.0 / rank
return 0.0
print(f"{'Query':<40} {'Bi NDCG':>8} {'Re NDCG':>8} {'Move':>6}")
print("-" * 65)
bi_ndcgs, re_ndcgs = [], []
bi_mrrs, re_mrrs = [], []
for query, relevant in test_queries:
q_emb = embedder.encode([query], normalize_embeddings=True)
scores = (embeddings @ q_emb.T).flatten()
bi_top20 = np.argsort(scores)[::-1][:20]
pairs = [(query, chunks[i]) for i in bi_top20]
re_scores = reranker.predict(pairs)
re_ranked = [bi_top20[i] for i in np.argsort(re_scores)[::-1]]
bi_n = ndcg_at_k(bi_top20, relevant, 5)
re_n = ndcg_at_k(re_ranked, relevant, 5)
bi_ndcgs.append(bi_n)
re_ndcgs.append(re_n)
bi_mrrs.append(mrr(bi_top20, relevant))
re_mrrs.append(mrr(re_ranked, relevant))
rel_idx = list(relevant)[0]
bi_pos = list(bi_top20).index(rel_idx) + 1 if rel_idx in bi_top20 else -1
re_pos = list(re_ranked).index(rel_idx) + 1 if rel_idx in re_ranked else -1
move = bi_pos - re_pos if bi_pos > 0 and re_pos > 0 else 0
arrow = f"+{move}" if move > 0 else str(move)
print(f"{query[:38]:<40} {bi_n:>8.3f} {re_n:>8.3f} {arrow:>6}")
print(f"\n{'Mean NDCG@5:':<40} {np.mean(bi_ndcgs):>8.3f} {np.mean(re_ndcgs):>8.3f}")
print(f"{'Mean MRR:':<40} {np.mean(bi_mrrs):>8.3f} {np.mean(re_mrrs):>8.3f}")
print(f"{'NDCG improvement:':<40} {np.mean(re_ndcgs) - np.mean(bi_ndcgs):>+8.3f}")
The "Move" column shows how many positions the correct chunk jumped after re-ranking. Positive numbers mean the re-ranker moved it closer to the top -- which is exactly what you want. Queries with distinctive terms ("FAISS", "BERT") might not need re-ranking because the bi-encoder already puts them at position 1. The re-ranker earns its keep on ambiguous queries where multiple chunks contain similar vocabulary.
Exercise 3: RAG failure diagnostic tool.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
chunks = [
"Python was created by Guido van Rossum, first released in 1991.",
"The transformer architecture was introduced in 2017 by Vaswani et al.",
"RAG retrieves relevant documents then uses them as context for generation.",
"FAISS is Meta's library for efficient similarity search over vectors.",
"Gradient descent updates parameters using partial derivatives of the loss.",
"BERT uses masked language modeling where random tokens are hidden.",
"CNNs apply convolutional filters to detect spatial patterns in images.",
"K-means partitions data into k clusters minimizing within-cluster variance.",
"Dropout randomly sets neuron outputs to zero during training.",
"The attention mechanism lets models weigh input parts by relevance.",
]
embeddings = model.encode(chunks, normalize_embeddings=True)
test_set = [
("Who invented Python?", "Guido van Rossum created Python in 1991.", {0}),
("When were transformers introduced?", "The transformer was introduced in 2017.", {1}),
("What is RAG?", "RAG retrieves documents and uses them for generation.", {2}),
("What does FAISS do?", "FAISS does efficient similarity search.", {3}),
("How does gradient descent work?", "It updates parameters via derivatives.", {4}),
("What training method does BERT use?", "Masked language modeling.", {5}),
("How do CNNs find patterns?", "Through convolutional filters.", {6}),
("What does k-means minimize?", "Within-cluster variance.", {7}),
("How does dropout help?", "By randomly zeroing outputs during training.", {8}),
("What is the attention mechanism?", "It weighs input parts by relevance.", {9}),
]
def retrieve(query, k=3):
q_emb = model.encode([query], normalize_embeddings=True)
scores = (embeddings @ q_emb.T).flatten()
return list(np.argsort(scores)[::-1][:k])
retrieval_fails = 0
generation_fails = 0
successes = 0
fixable_by_expansion = 0
fixable_by_rerank = 0
print(f"{'Query':<35} {'Status':<12} {'Fix'}")
print("-" * 65)
for question, answer, relevant in test_set:
top3 = retrieve(question, k=3)
hit = bool(set(top3) & relevant)
if not hit:
status = "RETR FAIL"
# Test query expansion fix
alt_queries = [question,
question.replace("What", "Explain"),
question + " definition"]
expanded_results = set()
for q in alt_queries:
expanded_results.update(retrieve(q, k=3))
fixed = bool(expanded_results & relevant)
fix = "expansion: YES" if fixed else "expansion: NO"
retrieval_fails += 1
if fixed:
fixable_by_expansion += 1
else:
# Check if correct chunk is at position 1
rel_idx = list(relevant)[0]
pos = top3.index(rel_idx) + 1 if rel_idx in top3 else -1
if pos == 1:
status = "SUCCESS"
fix = "-"
successes += 1
else:
status = "GEN RISK"
fix = f"rerank: pos {pos}->1"
fixable_by_rerank += 1
print(f"{question[:33]:<35} {status:<12} {fix}")
print(f"\n--- Diagnostic Summary ---")
print(f"Successes: {successes}/10")
print(f"Retrieval failures: {retrieval_fails}/10")
print(f"Generation risks: {fixable_by_rerank}/10")
print(f"Fixed by expansion: {fixable_by_expansion}/{retrieval_fails}")
print(f"Fixed by re-rank: {fixable_by_rerank}/{fixable_by_rerank}")
The diagnostic separates two different failure modes. A retrieval failure means the correct chunk never made it into the top 3 -- query expansion might fix it by trying alternative phrasings. A generation risk means the correct chunk was retrieved but buried at position 2 or 3 -- re-ranking (from episode #65) would push it to position 1 where the LLM is most likely to use it.
On to today's episode
Here we go! We've spent 65 episodes understanding how AI works from the inside -- from the very first linear regression in episode #10 all the way through building transformers from scratch in episode #56, training LLMs in episode #60, and wiring up full RAG pipelines in episodes #64 and #65. We know how these models work. Now it's time to use them.
In practice, most AI applications don't train models from scratch. They call APIs. You send a prompt, you get a response, you pay per token. The engineering challenge shifts from "how does the model work internally?" (which we've covered extensively) to "how do I use it effectively, reliably, and affordably?" And that's a genuinely different skill set -- one that involves understanding HTTP APIs, token economics, error handling, and provider-specific quirks rather than gradient descent and backpropagation ;-)
The provider landscape
The major LLM API providers as of 2026:
OpenAI (GPT-4, GPT-4o, o1): the first-mover and still the most widely used API. Strong at reasoning, coding, and general tasks. The most extensive third-party ecosystem of tools, wrappers, and documentation. Their API format has become the de facto standard that other providers copy.
Anthropic (Claude 3.5 Sonnet, Claude 3 Opus): excels at long-context tasks (200K tokens), careful instruction following, and nuanced reasoning. Strong safety focus. The messages API differs from OpenAI's in how system prompts are handled (separate system parameter vs system message in the messages array).
Google (Gemini 1.5 Pro, Gemini Ultra): competitive with the top models, tightly integrated with Google's ecosystem. Gemini 1.5 Pro handles up to 1M token contexts -- the largest available. If you're already in the Google Cloud ecosystem, the integration is straightforward.
Open-source (LLaMA 3, Mistral, Mixtral, Qwen): run locally or on your own infrastructure. No API costs, full control over the model, no data leaving your servers. The tradeoff: you handle hosting, inference optimization, and scaling yourself. We'll cover local deployment in a future episode.
All major providers have converged on a similar API format. Learn one, and switching between them is mostly changing a URL, a parameter name, and maybe how the system prompt is passed. Having said that, each provider has its own quirks and the devil is in the details.
The Chat Completions API
The standard interface that every provider has adopted (with minor variations):
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to check if a number is prime."},
],
temperature=0.2,
max_tokens=500,
)
print(response.choices[0].message.content)
print(f"Tokens: {response.usage.prompt_tokens} in, "
f"{response.usage.completion_tokens} out")
Three message roles define the conversation:
- system: sets the model's behavior, persona, and constraints. Processed first, shapes every response. Think of it as the model's "job description"
- user: the human's input -- questions, instructions, data
- assistant: the model's previous responses (you include these for multi-turn conversations so the model has context of what it already said)
The temperature parameter (0.0 to 2.0) controls randomness. We covered this in detail in episode #62 when we discussed sampling -- lower temperature makes the model more deterministic and focused, higher temperature makes it more creative and varied. For code generation and factual Q&A, 0.0-0.3 is typical. For creative writing, 0.7-1.0.
max_tokens caps the response length. This is important for cost control (you pay per output token) and for preventing the model from rambling. If you need a concise answer, set it to 200-300. If you need a detailed explanation, 1000-2000.
Multi-turn conversations
For a multi-turn conversation, you maintain the full message history and send it with every request. The model itself is stateless -- it doesn't remember previous calls. Your code is responsable for maintaining context:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "system", "content": "You are a Python tutor."}]
def chat(user_message):
messages.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.3,
)
assistant_msg = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_msg})
return assistant_msg
# Each call includes the full conversation history
print(chat("What are list comprehensions?"))
print(chat("Show me a nested example.")) # model sees prior context
print(chat("Now rewrite that using a regular for loop.")) # still has context
# The messages list grows with each turn:
print(f"\nConversation has {len(messages)} messages "
f"({sum(len(m['content']) for m in messages)} chars)")
This is why context windows matter. Every turn adds to the message history. A conversation that goes back and forth 20 times might accumulate 10,000+ tokens of history, eating into your context budget. Production chat systems typically implement sliding window strategies -- keeping the system prompt and the last N messages, summarizing or dropping older messages to stay within limits. We'll see exactly how to build this kind of context management when we cover building agents in upcoming episodes.
Streaming responses
For interactive applications, waiting for the full response to complete before showing anything creates a terrible user experience. The model might take 5-10 seconds to generate a long answer, and the user is staring at a blank screen. Streaming delivers the response token by token as it's generated:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
stream = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user",
"content": "Explain gradient descent in 3 sentences."},
],
stream=True,
)
full_response = ""
token_count = 0
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
full_response += delta
token_count += 1
print() # newline after streaming completes
print(f"\nReceived ~{token_count} chunks")
Streaming doesn't change the output -- the model generates the same tokens in the same order. It just delivers them incrementally as they're produced. Time-to-first-token is typically 200-500ms; total generation time depends on output length. The user sees words appearing in real time, which feels MUCH more responsive even though the total wait is identical.
One subtlety: when streaming, the usage field (token counts) is typically not available until the stream completes. Some providers include it in the final chunk, others don't include it at all in streaming mode. If you need exact token counts for billing, you may need to use the non-streaming API or count tokens client-side using a tokenizer (which is its own topic -- we'll cover tokenization in a dedicated episode).
Function calling and tool use
This is where things get interesting. Function calling (also called tool use) lets the model decide when to call external functions and what arguments to pass. You define available functions with JSON schemas; the model generates structured calls when it determines they're needed:
import os
import json
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Define available tools as JSON schemas
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. Amsterdam",
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"default": "celsius",
},
},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": "What's the weather in Amsterdam?"},
],
tools=tools,
)
# The model doesn't CALL the function -- it generates a structured
# request for YOUR code to execute
msg = response.choices[0].message
if msg.tool_calls:
tool_call = msg.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
# Output: Function: get_weather
# Arguments: {"location": "Amsterdam", "units": "celsius"}
This is a critical distinction -- the model does NOT execute functions. It generates structured call requests that your code inspects, validates, and executes (or rejects). The model is sandboxed: it can ask for actions but can't take them directly. Your code decides whether to honor each request.
The full loop looks like this:
import os
import json
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}]
def get_weather(location, units="celsius"):
"""Simulated weather API -- in production, call a real API."""
data = {
"Amsterdam": {"temp": 14, "condition": "Cloudy"},
"Tokyo": {"temp": 22, "condition": "Sunny"},
}
info = data.get(location, {"temp": 20, "condition": "Unknown"})
return json.dumps({"location": location, **info, "units": units})
# Step 1: send message + tools
messages = [{"role": "user", "content": "What's the weather in Amsterdam?"}]
response = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=tools)
msg = response.choices[0].message
# Step 2: execute the tool call
if msg.tool_calls:
tc = msg.tool_calls[0]
args = json.loads(tc.function.arguments)
result = get_weather(**args)
print(f"Tool called: {tc.function.name}({args})")
print(f"Tool result: {result}")
# Step 3: send tool result back to the model
messages.append(msg) # include the assistant's tool_call message
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": result,
})
final = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=tools)
print(f"\nFinal answer: {final.choices[0].message.content}")
The three-step dance: (1) model sees the question and available tools, decides to call get_weather, (2) your code executes the actual function and gets a result, (3) you send the result back and the model generates a natural language answer incorporating the data. This pattern is the foundation for building AI agents -- systems that can take actions in the real world through tool calls. We'll go much deeper into agent architectures in upcoming episodes.
Structured outputs and JSON mode
Sometimes you don't want free-form text -- you want the model to return structured data you can parse programmatically. JSON mode forces the model's output to be valid JSON:
import os
import json
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system",
"content": "Extract entities from text. Return JSON with keys: "
"persons (list), organizations (list), locations (list)."},
{"role": "user",
"content": "Guido van Rossum created Python at CWI in Amsterdam."},
],
response_format={"type": "json_object"},
temperature=0.0,
)
data = json.loads(response.choices[0].message.content)
print(json.dumps(data, indent=2))
# {
# "persons": ["Guido van Rossum"],
# "organizations": ["CWI"],
# "locations": ["Amsterdam"]
# }
# Now you can use the structured data in code
for person in data.get("persons", []):
print(f"Person found: {person}")
This is enormously practical. In stead of parsing natural language output with fragile regex patterns ("hope the model says it in the format I expect"), you get guaranteed valid JSON. The model might still get the content wrong (hallucinate entities, miss some), but the structure is reliable. For production pipelines where LLM output feeds into downstream code, structured outputs are essentially mandatory.
OpenAI also supports strict structured outputs with a JSON schema parameter that enforces exact key names and types. Anthropic has a similar feature via tool use with a defined schema. The specifics differ, but the pattern is the same: tell the model exactly what structure you need, and it conforms.
Error handling and retries
API calls fail. Networks time out. Rate limits trigger. Servers return 500s. A robust client handles all of these gracefully with exponential backoff:
import os
import time
from openai import OpenAI, RateLimitError, APITimeoutError, APIConnectionError
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def call_llm(messages, max_retries=3, **kwargs):
"""Call the LLM with automatic retry and exponential backoff."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=kwargs.get("model", "gpt-4o"),
messages=messages,
temperature=kwargs.get("temperature", 0.3),
max_tokens=kwargs.get("max_tokens", 1000),
)
return response.choices[0].message.content
except RateLimitError:
wait = 2 ** attempt # 1s, 2s, 4s
print(f"Rate limited. Waiting {wait}s... "
f"(attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
except APITimeoutError:
if attempt == max_retries - 1:
raise
print(f"Timeout. Retrying ({attempt + 1}/{max_retries})...")
time.sleep(1)
except APIConnectionError as e:
if attempt == max_retries - 1:
raise
print(f"Connection error: {e}. Retrying...")
time.sleep(1)
raise Exception(f"Failed after {max_retries} retries")
# Usage
# result = call_llm([{"role": "user", "content": "Hello!"}])
# print(result)
Exponential backoff is the key pattern: wait 1 second after the first failure, 2 seconds after the second, 4 seconds after the third. This prevents hammering the endpoint (which would just extend your rate limit) while giving the server time to recover.
Rate limits vary by provider and pricing tier. OpenAI limits are per-minute (both tokens-per-minute and requests-per-minute). When you hit them, the API returns HTTP 429 with a Retry-After header telling you exactly how long to wait. In production, you'd also want to add jitter (a small random delay on top of the backoff) to prevent multiple clients from retrying in sync and creating a thundering herd.
Cost optimization
LLM APIs charge per token, both input and output. At GPT-4o pricing, you're looking at roughly $2.50 per million input tokens and $10 per million output tokens. That might sound cheap until you're processing thousands of requests per day.
def estimate_costs(scenarios):
"""Estimate API costs for different usage patterns."""
# Approximate pricing per 1M tokens (varies by provider/model)
pricing = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-3.5-sonnet": {"input": 3.00, "output": 15.00},
"claude-3-haiku": {"input": 0.25, "output": 1.25},
}
print(f"{'Scenario':<35} {'Model':<20} {'Daily':>10} {'Monthly':>10}")
print("-" * 78)
for name, queries_day, avg_in, avg_out in scenarios:
for model, prices in pricing.items():
daily_in = queries_day * avg_in / 1_000_000 * prices["input"]
daily_out = queries_day * avg_out / 1_000_000 * prices["output"]
daily = daily_in + daily_out
monthly = daily * 30
print(f"{name:<35} {model:<20} ${daily:>8.2f} ${monthly:>8.2f}")
print()
scenarios = [
("Small chatbot (100 q/day)", 100, 500, 200),
("RAG system (1000 q/day)", 1000, 2000, 500),
("Batch processing (10K items/day)", 10000, 1000, 300),
]
estimate_costs(scenarios)
print("Key insight: the right model for the job makes a HUGE difference.")
print("GPT-4o-mini at $0.15/M input is 16x cheaper than GPT-4o.")
print("Use the cheapest model that meets your quality requirements.")
Cost reduction strategies that actually work in production:
- Use the right model: GPT-4o-mini or Claude Haiku for classification, extraction, and simple tasks. Full models only for complex reasoning. This single choice often reduces costs by 10-20x.
- Cache responses: identical or near-identical queries can return cached results. A simple dictionary cache with query hashing catches repeated questions.
- Limit output tokens: set
max_tokensto prevent verbose responses. If you need a yes/no answer, don't let the model write 500 words. - Batch processing: OpenAI's Batch API offers 50% cost reduction for non-time-sensitive workloads. You submit a file of requests, they process it within 24 hours.
- Prompt optimization: shorter, more focused prompts reduce input token count. Every word in your system prompt is multiplied by every request.
- Local models: for high-volume, low-complexity tasks, running an open-source model locally eliminates API costs entirely. The upfront cost is hardware; the marginal cost per query approaches zero.
Building a multi-provider client
A practical pattern I use quite often: abstract the provider behind a common interface so you can switch between them, A/B test, and implement fallbacks:
import os
class LLMClient:
"""Provider-agnostic LLM client with automatic fallback."""
def __init__(self, provider="openai", model=None):
self.provider = provider
if provider == "openai":
from openai import OpenAI
self.client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY"))
self.model = model or "gpt-4o"
elif provider == "anthropic":
import anthropic
self.client = anthropic.Anthropic(
api_key=os.environ.get("ANTHROPIC_API_KEY"))
self.model = model or "claude-3-5-sonnet-20241022"
else:
raise ValueError(f"Unknown provider: {provider}")
def chat(self, messages, **kwargs):
"""Send a chat completion request."""
if self.provider == "openai":
return self._openai_chat(messages, **kwargs)
elif self.provider == "anthropic":
return self._anthropic_chat(messages, **kwargs)
def _openai_chat(self, messages, **kwargs):
resp = self.client.chat.completions.create(
model=self.model,
messages=messages,
temperature=kwargs.get("temperature", 0.3),
max_tokens=kwargs.get("max_tokens", 1000),
)
return {
"content": resp.choices[0].message.content,
"tokens_in": resp.usage.prompt_tokens,
"tokens_out": resp.usage.completion_tokens,
"model": resp.model,
}
def _anthropic_chat(self, messages, **kwargs):
# Anthropic handles system prompts differently
system = ""
user_msgs = []
for m in messages:
if m["role"] == "system":
system = m["content"]
else:
user_msgs.append(m)
resp = self.client.messages.create(
model=self.model,
system=system,
messages=user_msgs,
max_tokens=kwargs.get("max_tokens", 1000),
temperature=kwargs.get("temperature", 0.3),
)
return {
"content": resp.content[0].text,
"tokens_in": resp.usage.input_tokens,
"tokens_out": resp.usage.output_tokens,
"model": resp.model,
}
def chat_with_fallback(messages, primary="openai", fallback="anthropic",
**kwargs):
"""Try primary provider, fall back to secondary on failure."""
try:
llm = LLMClient(provider=primary)
result = llm.chat(messages, **kwargs)
result["provider"] = primary
return result
except Exception as e:
print(f"{primary} failed: {e}. Falling back to {fallback}...")
llm = LLMClient(provider=fallback)
result = llm.chat(messages, **kwargs)
result["provider"] = fallback
return result
# Usage -- swap providers with one parameter change
# llm = LLMClient(provider="openai")
# llm = LLMClient(provider="anthropic")
# result = llm.chat([
# {"role": "system", "content": "Be concise."},
# {"role": "user", "content": "What is RAG?"},
# ])
# print(f"[{result['model']}] {result['content']}")
# print(f"Tokens: {result['tokens_in']} in, {result['tokens_out']} out")
print("Multi-provider client enables:")
print(" - A/B testing between providers")
print(" - Automatic fallback on provider outages")
print(" - Route tasks to different models by complexity")
print(" - Switch providers without changing application code")
This abstraction is genuinely useful in production. You route simple classification tasks to the cheapest model, complex reasoning to the most capable, and if one provider goes down (it happens more often than you'd think), your system automatically falls back to the other. The application code never needs to know which provider served a particular request.
Token counting and context management
When building production systems, you need to know how many tokens your input will use before sending it. This lets you stay within context limits and estimate costs:
import tiktoken
def count_tokens(text, model="gpt-4o"):
"""Count tokens for a given text and model."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
def count_message_tokens(messages, model="gpt-4o"):
"""Count total tokens for a list of chat messages.
Each message has overhead tokens for role, separators, etc.
"""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
encoding = tiktoken.get_encoding("cl100k_base")
tokens_per_message = 3 # role + separators
total = 0
for msg in messages:
total += tokens_per_message
for key, value in msg.items():
total += len(encoding.encode(str(value)))
total += 3 # reply priming tokens
return total
# Examples
texts = [
"Hello, world!",
"Explain the transformer architecture in detail.",
"The quick brown fox " * 50, # ~200 words
]
print(f"{'Text':<50} {'Tokens':>7} {'Est cost (in)':>14}")
print("-" * 75)
for text in texts:
tokens = count_tokens(text)
cost = tokens / 1_000_000 * 2.50 # GPT-4o input pricing
print(f"{text[:48]:<50} {tokens:>7} ${cost:>12.6f}")
# Message token counting
messages = [
{"role": "system", "content": "You are a coding assistant." * 10},
{"role": "user", "content": "Write a fibonacci function in Python."},
]
msg_tokens = count_message_tokens(messages)
print(f"\nChat messages: {msg_tokens} tokens")
print(f"Context budget remaining: {128000 - msg_tokens} tokens "
f"(GPT-4o 128K window)")
The tiktoken library (maintained by OpenAI) gives you exact token counts for OpenAI models. For Anthropic models, their Python SDK includes token counting as well. Knowing your token count before sending the request is essential for two reasons: staying within the model's context window limit, and predicting costs accurately.
Why does this matter so much? Because at production scale, even a 10% reduction in average prompt length translates into real money. I've seen teams cut their LLM costs in half just by trimming unnecessary context from their system prompts and being more deliberate about what gets included in each request.
Putting it together
The tools we've covered today form the practical toolkit for working with LLMs in real applications. We went from raw API calls to streaming, function calling, structured outputs, error handling, cost optimization, and multi-provider abstraction. Every production LLM application uses some combination of these patterns.
What we haven't covered yet (and what's coming) is the next level up: using these API primitives to build agents -- systems that can plan, use tools, maintain state, and accomplish complex tasks autonomously. The function calling and tool use patterns we built today are the foundation for that. And the RAG pipelines from episodes #64 and #65 provide the knowledge layer. Put them together and you get systems that can reason over your data AND take actions based on what they find.
The bottom line
- Major LLM providers (OpenAI, Anthropic, Google) all offer similar chat completion APIs with system/user/assistant message roles. The format has largely standardized;
- Multi-turn conversations require sending the full message history with each request -- the model is stateless;
- Streaming delivers tokens incrementally for better UX -- same output, delivered progressively. Essential for interactive applications;
- Function calling lets models generate structured API calls that YOUR code executes -- the model requests, you decide. This is the foundation for AI agents;
- Structured outputs (JSON mode) guarantee valid JSON responses for programmatic consumption. Stop parsing free-form text with regex;
- Exponential backoff handles rate limits and transient errors gracefully. Add jitter to prevent thundering herds;
- Cost optimization starts with model selection -- the right-sized model for each task often reduces costs 10-20x;
- A provider-agnostic client abstraction enables switching, A/B testing, and automatic fallback between providers.
Exercises
Exercise 1: Build a token budget manager. Using tiktoken, write a class TokenBudget that takes a model name and max context size (default 128K for GPT-4o). It should have methods: add_system(text) to set the system prompt, add_message(role, content) to add a message, tokens_used() to return total tokens so far, tokens_remaining() to return available space, and trim_to_fit(reserve=1000) that removes the oldest non-system messages until there are at least reserve tokens free for the response. Test it by adding a system prompt plus 20 user/assistant message pairs (simulate a long conversation), then calling trim_to_fit(reserve=2000). Print the token count before and after trimming, and how many messages were removed.
Exercise 2: Write a multi-provider cost comparison tool. Define at least 3 provider-model combinations with their per-token pricing (use real published prices). Create a function that takes a list of 10 sample prompts of varying lengths (short classification, medium Q&A, long document summarization), estimates the token count for each, and produces a comparison table showing the cost per prompt and total cost across all 10 prompts for each provider-model combination. Include both input and output token costs (assume output is 30% of input length). Print the table and identify which model is cheapest overall and which is cheapest per quality tier (budget, standard, premium).
Exercise 3: Implement a retry wrapper with telemetry. Write a function call_with_retry(fn, max_retries=3) that wraps any callable with exponential backoff, jitter (random 0-0.5s added to each wait), and detailed telemetry tracking. The wrapper should record: number of attempts, total wall-clock time, each error encountered (type and message), and whether the final result was a success or failure. Simulate API failures by writing a FlakyAPI class whose call() method fails with RateLimitError 40% of the time and TimeoutError 20% of the time (use random.random() to decide). Run 50 calls through the retry wrapper and print a summary report: success rate, average attempts per successful call, average latency, and a breakdown of error types encountered.