Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
What will I learn
- You will learn what "agents" actually means beyond the marketing hype;
- the agent loop: observe, think, act, observe;
- tool use -- LLMs that take actions in the world;
- ReAct: reasoning and acting in interleaved steps;
- planning: breaking complex tasks into manageable steps;
- memory: conversation history, summarization, and working memory;
- failure modes and how to design around them.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations (this post)
Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
Solutions to Episode #66 Exercises
Exercise 1: Multi-provider API client with unified interface, retry logic, and automatic failover.
import time
import json
import random
class LLMProvider:
"""Base class for LLM providers."""
def __init__(self, name, api_key, model, base_url):
self.name = name
self.api_key = api_key
self.model = model
self.base_url = base_url
self.failures = 0
self.total_calls = 0
self.total_tokens = 0
def chat(self, messages, temperature=0.7, max_tokens=1000):
raise NotImplementedError
class MockOpenAI(LLMProvider):
def __init__(self, api_key="sk-fake"):
super().__init__("openai", api_key, "gpt-4o-mini",
"https://api.openai.com/v1")
def chat(self, messages, temperature=0.7, max_tokens=1000):
self.total_calls += 1
if random.random() < 0.15:
self.failures += 1
raise ConnectionError(f"{self.name}: simulated API timeout")
tokens = random.randint(50, 200)
self.total_tokens += tokens
return {
"content": f"[{self.name}] Response to: "
f"{messages[-1]['content'][:50]}",
"model": self.model, "tokens": tokens,
}
class MockAnthropic(LLMProvider):
def __init__(self, api_key="sk-ant-fake"):
super().__init__("anthropic", api_key, "claude-sonnet-4-20250514",
"https://api.anthropic.com/v1")
def chat(self, messages, temperature=0.7, max_tokens=1000):
self.total_calls += 1
if random.random() < 0.1:
self.failures += 1
raise ConnectionError(f"{self.name}: simulated rate limit")
tokens = random.randint(60, 220)
self.total_tokens += tokens
return {
"content": f"[{self.name}] Response to: "
f"{messages[-1]['content'][:50]}",
"model": self.model, "tokens": tokens,
}
class MultiProviderClient:
def __init__(self, providers, max_retries=3, retry_delay=0.5):
self.providers = providers
self.primary = providers[0]
self.max_retries = max_retries
self.retry_delay = retry_delay
self.failover_log = []
def chat(self, messages, **kwargs):
for attempt in range(self.max_retries):
try:
return self.primary.chat(messages, **kwargs)
except Exception as e:
print(f" Primary ({self.primary.name}) attempt "
f"{attempt+1} failed: {e}")
if attempt < self.max_retries - 1:
time.sleep(self.retry_delay * (attempt + 1))
for provider in self.providers[1:]:
try:
result = provider.chat(messages, **kwargs)
self.failover_log.append({
"from": self.primary.name,
"to": provider.name,
})
print(f" Failover: {self.primary.name} -> {provider.name}")
return result
except Exception as e:
print(f" Failover {provider.name} also failed: {e}")
raise RuntimeError("All providers failed")
def stats(self):
print(f"\n{'Provider':<15} {'Calls':>6} {'Failures':>9} "
f"{'Tokens':>7} {'Fail%':>6}")
print("-" * 48)
for p in self.providers:
fail_pct = (p.failures / p.total_calls * 100
if p.total_calls > 0 else 0)
print(f"{p.name:<15} {p.total_calls:>6} {p.failures:>9} "
f"{p.total_tokens:>7} {fail_pct:>5.1f}%")
print(f"Failovers: {len(self.failover_log)}")
random.seed(42)
client = MultiProviderClient([MockOpenAI(), MockAnthropic()])
msgs = [{"role": "user", "content": "Explain gradient descent"}]
for i in range(20):
try:
result = client.chat(msgs)
print(f"Query {i+1}: {result['content'][:60]}...")
except RuntimeError as e:
print(f"Query {i+1}: TOTAL FAILURE - {e}")
client.stats()
The retry + failover pattern is what real production systems use. Your primary provider handles 85-95% of requests. When it goes down (and it will go down), the failover kicks in transparently. The stats tracking lets you monitor which providers are flaky and adjust accordingly.
Exercise 2: Structured output parser with JSON mode comparison.
import json
import re
raw_responses = [
'{"name": "Python", "year": 1991, "creator": "Guido van Rossum", '
'"paradigms": ["OOP", "functional", "procedural"], "typed": false}',
'```json\n{"name": "Rust", "year": 2010, "creator": "Graydon Hoare", '
'"paradigms": ["systems", "functional"], "typed": true}\n```',
'Here is the info:\n{"name": "Go", "year": 2009, '
'"creator": "Rob Pike", "paradigms": ["concurrent"], "typed": true}'
'\nHope that helps!',
'{"name": "JavaScript", "year": 1995, "creator": "Brendan Eich"',
'I cannot provide that information.',
]
def extract_json(text):
text = text.strip()
try:
return json.loads(text), "direct"
except json.JSONDecodeError:
pass
fence_match = re.search(r'```(?:json)?\s*\n?(.*?)\n?```',
text, re.DOTALL)
if fence_match:
try:
return json.loads(fence_match.group(1)), "fence"
except json.JSONDecodeError:
pass
brace_match = re.search(r'\{.*\}', text, re.DOTALL)
if brace_match:
try:
return json.loads(brace_match.group()), "extracted"
except json.JSONDecodeError:
pass
return None, "failed"
def validate_schema(data):
required = {"name": str, "year": int, "creator": str, "paradigms": list}
errors = []
if not isinstance(data, dict):
return ["Not a dictionary"]
for field, expected_type in required.items():
if field not in data:
errors.append(f"Missing: {field}")
elif not isinstance(data[field], expected_type):
errors.append(f"{field}: wrong type")
return errors
print(f"{'#':<4} {'Method':<12} {'Valid':>6} {'Result'}")
print("-" * 50)
for i, resp in enumerate(raw_responses):
parsed, method = extract_json(resp)
if parsed is None:
print(f"{i+1:<4} {method:<12} {'N/A':>6} Could not parse")
else:
errors = validate_schema(parsed)
valid = "Yes" if not errors else "No"
detail = parsed.get("name", "?") if not errors else "; ".join(errors)
print(f"{i+1:<4} {method:<12} {valid:>6} {detail}")
Structured output parsing is one of those things that looks trivial until you've dealt with 50 different ways LLMs format their JSON. The fence extraction and brace-matching fallbacks catch most real-world cases.
Exercise 3: Streaming response handler with token counting and performance metrics.
import time
import random
def simulate_stream(text, chunk_size=3, delay=0.02):
words = text.split()
for i in range(0, len(words), chunk_size):
chunk = " ".join(words[i:i+chunk_size])
time.sleep(delay)
yield {"delta": chunk, "finish_reason": None}
yield {"delta": "", "finish_reason": "stop"}
class StreamHandler:
def __init__(self):
self.chunks = []
self.token_count = 0
self.start_time = None
self.first_token_time = None
self.end_time = None
def process_stream(self, stream):
self.start_time = time.time()
full_text = ""
for event in stream:
if event["delta"]:
if self.first_token_time is None:
self.first_token_time = time.time()
self.chunks.append(event["delta"])
self.token_count += len(event["delta"].split())
full_text += (" " if full_text else "") + event["delta"]
print(event["delta"], end=" ", flush=True)
if event["finish_reason"] == "stop":
self.end_time = time.time()
print()
return full_text
def metrics(self):
total = self.end_time - self.start_time
ttft = (self.first_token_time - self.start_time
if self.first_token_time else 0)
tps = self.token_count / total if total > 0 else 0
return {
"total_time": f"{total:.3f}s",
"time_to_first_token": f"{ttft:.3f}s",
"tokens": self.token_count,
"tokens_per_second": f"{tps:.1f}",
}
response_text = (
"Gradient descent is an optimization algorithm that iteratively "
"adjusts parameters by moving in the direction of steepest "
"descent of the loss function. The learning rate controls the "
"step size. Adam combines momentum with adaptive learning rates."
)
handler = StreamHandler()
print("Streaming:")
handler.process_stream(simulate_stream(response_text))
print("\nMetrics:")
for k, v in handler.metrics().items():
print(f" {k}: {v}")
Time-to-first-token (TTFT) is the metric users actually feel. A response that starts streaming in 200ms feels snappy even if the full response takes 3 seconds. Streaming gives users something to read while the model is still thinking.
On to today's episode
Here we go! In episode #66 we got our hands on the actual APIs that power language models -- sending messages, handling responses, streaming tokens, structuring output. All of that was about talking to an LLM. Today we go a step further: what happens when the LLM talks back to the world?
The word "agent" is probably the most overloaded term in AI right now. Every chatbot wrapper slaps "agentic" on its landing page. Every startup with a for-loop around an API call claims agentic capabilities. Marketing departments have turned it into meaningless buzzword soup. So let's cut through that noise and actually build something real ;-)
An AI agent is a system where an LLM decides what actions to take, executes them, observes the results, and decides what to do next -- in a loop, until the task is complete. The key difference from a regular LLM call: the model doesn't just generate text. It takes actions that change state in the world (reading files, running code, searching the web, calling APIs), and it adapts its behavior based on what happens.
A chatbot answers questions. An agent completes tasks. That's the distinction, and it's more fundamental than it might sound.
The agent loop
Every agent, from the simplest script to the most sophisticated multi-step autonomous system, follows the same core loop:
- Observe: receive input (user request, tool output, environment state)
- Think: reason about what to do next (the LLM generates a plan or picks the next action)
- Act: execute an action (call a function, run code, query a database)
- Observe: receive the result of that action
- Repeat until the task is complete or a stopping condition is met
If this reminds you of reinforcement learning (which we'll cover later in the series), that's not a coincidence. The observe-think-act loop is a fundamental pattern in intelligent systems. The difference is that RL agents learn their policy through trial and error over thousands of episodes, while LLM agents arrive with a pre-trained "policy" (the model's weights) and operate in-context -- no gradient updates needed, just prompt engineering and tool definitions.
import json
class SimpleAgent:
"""Minimal agent: LLM + tools + loop."""
def __init__(self, tools, system_prompt="You are a helpful assistant."):
self.tools = {t['name']: t for t in tools}
self.system = system_prompt
self.messages = [{"role": "system", "content": system_prompt}]
self.max_steps = 10
self.step_log = []
def run(self, user_request):
"""Run the agent loop until completion or max steps."""
self.messages.append({"role": "user", "content": user_request})
for step in range(self.max_steps):
# Think: ask the LLM what to do next
response = self._call_llm(self.messages)
# Check if the LLM wants to use a tool
if response.get("tool_call"):
call = response["tool_call"]
tool_name = call["name"]
tool_args = call["arguments"]
# Act: execute the tool
print(f" Step {step+1}: calling {tool_name}({tool_args})")
result = self._execute_tool(tool_name, tool_args)
# Observe: feed result back into context
self.messages.append({
"role": "tool",
"name": tool_name,
"content": str(result),
})
self.step_log.append({
"step": step + 1,
"tool": tool_name,
"args": tool_args,
"result": str(result)[:200],
})
else:
# No tool call = agent is done
print(f" Completed in {step+1} steps")
return response["content"]
return "Max steps reached without completing the task."
def _call_llm(self, messages):
"""Simulated LLM call -- in practice, use OpenAI/Anthropic API."""
last_msg = messages[-1]
content = last_msg.get("content", "")
if last_msg["role"] == "user" and "calculate" in content.lower():
return {
"tool_call": {
"name": "calculator",
"arguments": {"expression": "42 * 17"},
}
}
elif last_msg["role"] == "tool":
return {"content": f"Based on the tool result: {content}"}
return {"content": "I can help with that!"}
def _execute_tool(self, name, args):
"""Execute a tool by name with given arguments."""
if name not in self.tools:
return f"Error: unknown tool '{name}'"
func = self.tools[name]["function"]
return func(**args)
# Define some tools
def calculator(expression):
"""Evaluate a math expression safely."""
allowed = set("0123456789+-*/.() ")
if all(c in allowed for c in expression):
return {"result": eval(expression)}
return {"error": "Invalid characters in expression"}
def lookup_fact(topic):
"""Look up a fact about a topic."""
facts = {
"python": "Python was created by Guido van Rossum in 1991.",
"transformer": "The transformer was introduced in 2017.",
}
return facts.get(topic.lower(), f"No fact found for '{topic}'")
tools = [
{"name": "calculator", "function": calculator,
"description": "Evaluate a math expression",
"parameters": {"expression": "string"}},
{"name": "lookup_fact", "function": lookup_fact,
"description": "Look up a fact about a topic",
"parameters": {"topic": "string"}},
]
agent = SimpleAgent(tools)
result = agent.run("Please calculate 42 * 17 for me")
print(f"Final answer: {result}")
print(f"\nStep log: {json.dumps(agent.step_log, indent=2)}")
The max_steps limit is crucial. Without it, a confused agent loops forever -- calling the same tool repeatedly, making the same mistake over and over, or trying variations that never converge. In practice, most real tasks complete in 3-7 steps. If an agent needs more than 10, something is probably wrong with either the task description, the tool definitions, or both.
Tool use -- the hands of the agent
An LLM without tools is like a brain without a body -- it can think but can't interact with anything outside its context window. Tools give the agent actual capabilities. And the way you define those tools matters quit a lot more than you might expect.
import subprocess
import os
def search_web(query):
"""Search the web and return top results."""
# In production, use Serper, Tavily, Brave Search API, etc.
mock_results = {
"python release": [
{"title": "Python 3.13 Released",
"snippet": "New features include..."},
],
"default": [
{"title": "Search result",
"snippet": f"Results for: {query}"},
]
}
for key, results in mock_results.items():
if key in query.lower():
return results
return mock_results["default"]
def run_python(code):
"""Execute Python code in a subprocess and return output."""
result = subprocess.run(
["python3", "-c", code],
capture_output=True, text=True, timeout=10
)
return {
"stdout": result.stdout,
"stderr": result.stderr,
"returncode": result.returncode,
}
def read_file(path):
"""Read a file and return its contents."""
if not os.path.exists(path):
return f"Error: file '{path}' not found"
with open(path, 'r') as f:
content = f.read()
return content[:5000] # truncate long files
# Tool definitions in OpenAI function calling format
tool_definitions = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for current information. "
"Use when you need facts you don't know.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "run_python",
"description": "Execute Python code. Use for calculations, "
"data processing, or testing code snippets.",
"parameters": {
"type": "object",
"properties": {
"code": {
"type": "string",
"description": "Python code to execute"
}
},
"required": ["code"]
}
}
},
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read the contents of a file on disk.",
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Path to the file to read"
}
},
"required": ["path"]
}
}
},
]
for td in tool_definitions:
func = td["function"]
print(f"Tool: {func['name']}")
print(f" Desc: {func['description'][:65]}...")
params = func["parameters"]["properties"]
for pname, pinfo in params.items():
print(f" Param: {pname} ({pinfo['type']})")
print()
Tool design principles -- this matters more than the agent logic itself, and I want to emphasize that because people tend to spend all their time on the "agent framework" and almost no time on the tools:
Clear, specific descriptions: "Search the web for current information. Use when you need facts you don't know." tells the model when to use the tool, not just what it does. Vague descriptions like "search stuff" lead to the model calling the wrong tool or not calling it when it should.
Well-defined parameters: include types, descriptions, and which parameters are required. The model generates the function call arguments based entirely on the parameter schema -- that schema IS the model's understanding of how to use the tool. Bad schemas = bad arguments = failed tool calls.
Predictable output format: return structured data (dicts with known keys), not free-form strings. The model processes the tool result in its next step -- structured output is easier for it to reason about than a wall of unformatted text.
Error handling: return error messages, don't crash. If a tool throws an exception, the agent can't recover gracefully. Return
{"error": "file not found"}and the model can try a diferent path or ask the user for clarification.
And here's the security angle that people often skip over: an agent with a run_python tool can execute arbitrary code. An agent with write_file can overwrite system files. In a learning environment that's fine, but in production you NEED sandboxing -- restrict file system access to a specific directory, limit network calls, set execution timeouts, validate tool arguments before execution. The agent doesn't have malicious intent, but a confused agent running rm -rf / because it misunderstood the task is not something you want happening at 3 AM ;-)
ReAct -- reasoning and acting together
The ReAct pattern (Yao et al., 2022) is one of those ideas that seems obvious in retrospect but made a real difference in practice. In stead of just deciding which tool to call, the model first explains its thinking, then acts, then reflects on the result. Thought, Action, Observation -- in a loop.
Question: What is the population of the capital of France?
Thought: I need to find the capital of France first, then look up its
population. I know the capital is Paris, but I should verify the
population figure since I want an accurate number.
Action: search_web("Paris population 2024")
Observation: Paris has a population of approximately 2.1 million in
the city proper and 12.3 million in the metropolitan area.
Thought: I found the answer. The question asks about "the population"
without specifying city proper vs metro, so I should provide both.
Action: finish("The capital of France is Paris. Its population is
approximately 2.1 million (city proper) or 12.3 million
(metropolitan area).")
Why does explicitly writing out the thinking step help? Because without it, the model tends to jump straight to action and makes mistakes -- searching for the wrong thing, calling the wrong tool, or missing a step in multi-step reasoning. The "Thought" step forces structured deliberation. It's essentially chain-of-thought prompting (which we touched on in episode #62) applied to the action loop.
REACT_SYSTEM_PROMPT = """You solve tasks by thinking step by step
and using tools when needed.
For each step, follow this format:
Thought: reason about what you know and what you need to do next
Action: use one of the available tools
Observation: (provided by the system after tool execution)
Available tools:
{tool_descriptions}
When you have enough information to answer, respond with:
Thought: I now have all the information needed.
Action: finish(your final answer here)
Rules:
- Always think before acting
- Use tools only when you need information you don't have
- If a tool returns an error, try a different approach
- Never make up facts -- use tools to verify"""
def format_tool_descriptions(tools):
"""Format tool list for the system prompt."""
lines = []
for t in tools:
params = ", ".join(f"{k}: {v}"
for k, v in t.get("parameters", {}).items())
lines.append(f"- {t['name']}({params}): {t['description']}")
return "\n".join(lines)
tools = [
{"name": "search_web", "description": "Search the web",
"parameters": {"query": "string"}},
{"name": "calculator", "description": "Evaluate math expressions",
"parameters": {"expression": "string"}},
{"name": "read_file", "description": "Read a file",
"parameters": {"path": "string"}},
]
prompt = REACT_SYSTEM_PROMPT.format(
tool_descriptions=format_tool_descriptions(tools))
print(prompt)
ReAct is just a prompt pattern -- you don't need a special framework or library to use it. Add the format instructions to your system prompt, and the model will (usually) follow the Thought/Action/Observation structure. The thinking traces also make the agent's behavior interpretable -- you can see why it made each decision, which is incredibly useful for debugging. When an agent does something wrong, reading its thought trace tells you exactly where the reasoning went off the rails.
Planning -- breaking down complex tasks
Simple tasks need one or two tool calls. "What's the weather in Amsterdam?" -- one search, one answer. But complex tasks require decomposition. "Write a report comparing the performance of three sorting algorithms on various input sizes" requires: decide which algorithms to compare, write benchmark code, run benchmarks, collect results, analyze the data, write the actual report with tables and conclusions.
The simplest planning approach is the plan-then-execute pattern. Ask the LLM to create a plan before doing anything, then execute each step:
def plan_and_execute(task, tools):
"""Two-phase agent: plan first, then execute."""
# Phase 1: Create a plan (no tool use yet)
plan_prompt = f"""Create a step-by-step plan for this task.
Each step should be concrete and actionable.
Do NOT execute anything yet -- just plan.
Task: {task}
Output your plan as a numbered list."""
# Simulated plan (in practice, this comes from the LLM)
plan = [
"Search for the 3 most commonly benchmarked sorting algorithms",
"Write Python code implementing each algorithm",
"Create a benchmark harness that tests each on varying sizes",
"Run benchmarks and collect timing data",
"Format results into a comparison table",
"Write a summary of findings",
]
print(f"Plan for: {task}")
for i, step in enumerate(plan, 1):
print(f" {i}. {step}")
# Phase 2: Execute each step
results = []
for i, step in enumerate(plan, 1):
print(f"\nExecuting step {i}: {step}")
result = f"Completed: {step}"
results.append({"step": i, "description": step, "result": result})
print(f" Result: {result}")
# Phase 3: Synthesize
print(f"\nAll {len(results)} steps completed.")
return results
results = plan_and_execute(
"Compare bubble sort, merge sort, and quicksort performance",
tools=[]
)
The plan-then-execute pattern works well for tasks where you can anticipate all steps upfront. But real-world tasks are messy -- step 3 might reveal that you need a step 2.5 that wasn't in the original plan. For that, you need iterative planning:
import random
class IterativePlanner:
"""Agent that replans after each step based on results."""
def __init__(self, max_replans=5):
self.max_replans = max_replans
self.execution_log = []
def run(self, task):
current_plan = self._make_plan(task)
for replan_count in range(self.max_replans):
if not current_plan:
print("Plan complete -- no remaining steps.")
break
step = current_plan[0]
print(f"\n[Iteration {replan_count}] Executing: {step}")
result = self._execute_step(step)
self.execution_log.append({"step": step, "result": result})
# Replan based on what happened
remaining = current_plan[1:]
current_plan = self._replan(
task, self.execution_log, remaining)
print(f" Remaining plan ({len(current_plan)} steps): "
f"{current_plan[:2]}...")
return self.execution_log
def _make_plan(self, task):
return [
"Research current best practices",
"Implement core functionality",
"Test the implementation",
"Handle edge cases",
"Document the solution",
]
def _replan(self, task, log, remaining):
last = log[-1]
if "error" in last["result"].lower():
return ["Fix the error from previous step"] + remaining
return remaining
def _execute_step(self, step):
if random.random() < 0.2:
return "Error: unexpected format in input data"
return f"Success: {step} completed"
random.seed(42)
planner = IterativePlanner()
results = planner.run("Build a data analysis pipeline")
print(f"\nTotal steps executed: {len(results)}")
The difference: plan-then-execute assumes the plan is correct upfront. Iterative planning adapts. If step 2 reveals unexpected data, the plan adjusts. If step 3 fails, a fix step gets inserted. This is closer to how humans actually work -- you have a rough plan, start executing, and revise constantly based on what you learn along the way.
More advanced planning approaches exist: hierarchical planning (break task into phases, break phases into steps), planning with self-critique (generate a plan, then ask the model to find flaws in its own plan before executing). We'll explore these in part 2.
Memory -- what the agent remembers
An agent's memory is its message history. Every user message, every thought, every tool call, every tool result -- it all accumulates in the conversation context. And this creates two very real problems as the agent works longer:
Context overflow: after 20+ tool calls with verbose results, the conversation history might exceed the context window entirely. The agent literally can't remember its early actions. It re-does work, contradicts itself, or loses track of the original goal.
Context dilution: even within the window, important information from early steps gets buried under pages of tool results and intermediate reasoning. Remember the "lost in the middle" phenomenon we discussed in episode #64? Same problem here -- the model pays attention to the beginning and end of the context but glosses over the middle, which is exactly where your critical intermediate results live.
class AgentMemory:
"""Memory system with summarization and scratchpad."""
def __init__(self, max_messages=20):
self.messages = []
self.max_messages = max_messages
self.summary = ""
self.scratchpad = {} # persistent key-value storage
self.compression_count = 0
def add(self, role, content, name=None):
msg = {"role": role, "content": content}
if name:
msg["name"] = name
self.messages.append(msg)
if len(self.messages) > self.max_messages:
self._compress()
def _compress(self):
split = len(self.messages) // 2
old_messages = self.messages[:split]
old_text = []
for m in old_messages:
role = m["role"]
content = m["content"][:150]
if m.get("name"):
old_text.append(f"[{role}/{m['name']}] {content}")
else:
old_text.append(f"[{role}] {content}")
new_summary = "Previous actions:\n" + "\n".join(old_text)
if self.summary:
self.summary = f"{self.summary}\n\n{new_summary}"
else:
self.summary = new_summary
self.messages = self.messages[split:]
self.compression_count += 1
print(f" [Memory compressed: kept {len(self.messages)} msgs, "
f"summarized {split}]")
def note(self, key, value):
"""Write to scratchpad (persists across compressions)."""
self.scratchpad[key] = value
def recall(self, key):
return self.scratchpad.get(key, None)
def get_context(self):
ctx = []
if self.summary:
ctx.append({"role": "system",
"content": f"Context:\n{self.summary}"})
ctx.extend(self.messages)
return ctx
def stats(self):
return {
"active_messages": len(self.messages),
"compressions": self.compression_count,
"scratchpad_keys": list(self.scratchpad.keys()),
"summary_length": len(self.summary),
}
# Simulate a long-running agent session
memory = AgentMemory(max_messages=8)
steps = [
("user", "Analyze the sales data for Q3 2024"),
("assistant", "I'll start by reading the sales data file."),
("tool", "read_file: 1500 rows, columns: date, product, revenue"),
("assistant", "Found 1500 rows. Calculating totals by region."),
("tool", "run_python: North $2.3M, South $1.8M, East $2.1M, West $1.5M"),
("assistant", "Regional totals done. Finding top products now."),
("tool", "run_python: Top 3: Widget Pro $1.2M, Gadget X $0.9M"),
("assistant", "Top products identified. Comparing with Q2."),
("tool", "read_file: Q2 totals - North $2.0M, South $1.9M"),
("assistant", "Q2 data loaded. Calculating growth rates."),
("tool", "run_python: Growth - North +15%, South -5.3%, East +16.7%"),
("assistant", "Growth analysis complete. Writing final report."),
]
# Store key findings in scratchpad
memory.note("q3_totals", "N:$2.3M, S:$1.8M, E:$2.1M, W:$1.5M")
memory.note("growth", "N:+15%, E:+16.7%, S:-5.3%, W:-6.25%")
for role, content in steps:
memory.add(role, content)
print(f"\nMemory stats: {memory.stats()}")
print(f"\nScratchpad (survives compression):")
for k, v in memory.scratchpad.items():
print(f" {k}: {v}")
The scratchpad is the key insight here. Summaries are lossy -- you can't perfectly compress 10 detailed tool results into a 3-sentence summary without losing information. But explicit notes written by the agent ("Q3 North revenue: $2.3M") survive compressions intact. It's like the difference between remembering a meeting vs writing down the action items. The notes might not capture the nuance of the discussion, but they preserve the facts you actually need.
Vector memory is another approach worth mentioning: embed past interactions and retrieve relevant ones when needed. If the agent worked on Python code earlier and the user asks about Python again two hours later, the relevant past conversation gets retrieved via embedding similarity (exactly what we built in episode #63) and injected into the current context. This is RAG applied to the agent's own history -- meta-RAG, if you will ;-)
When agents go wrong
Agents fail in predictable ways. Understanding these failure modes helps you design around them before they bite you:
Infinite loops: the agent calls the same tool with the same arguments repeatedly, expecting different results. Fix: track tool call history, detect repeated identical calls, force a different action or terminate after 2-3 repetitions.
class LoopDetector:
"""Detect and break infinite loops in agent execution."""
def __init__(self, max_repeats=3):
self.history = []
self.max_repeats = max_repeats
def check(self, tool_name, args):
"""Returns True if this call is safe, False if loop detected."""
call_sig = f"{tool_name}:{args}"
self.history.append(call_sig)
recent = self.history[-self.max_repeats:]
if (len(recent) == self.max_repeats and
len(set(recent)) == 1):
return False
return True
def reset(self):
self.history = []
detector = LoopDetector(max_repeats=3)
calls = [
("search", "python"),
("search", "python"),
("search", "python"), # should trigger
("search", "rust"), # different query, OK
("calc", "1+1"),
("calc", "1+1"),
]
for tool, args in calls:
safe = detector.check(tool, args)
status = "OK" if safe else "LOOP DETECTED"
print(f" {tool}({args}) -> {status}")
Hallucinated tool calls: the model generates calls to tools that don't exist, or passes arguments that don't match the schema. This happens more often with smaller models or poorly defined tool descriptions. Fix: validate tool names against the registry and validate argument types against the schema before execution. Return a clear error message so the model can self-correct.
Task drift: the agent starts well but gradually loses focus, pursuing tangents or forgetting the original goal. After 15 steps of debugging a minor import error, it's forgotten that the actual task was "analyze sales data." Fix: include the original task description in a system message that persists across the entire conversation. Add periodic "reflection" steps: "Check: am I still working toward the original goal?"
Overconfidence: the agent reports "Task complete!" when it hasn't actually finished. It says "The file has been updated" but the write_file call returned an error that it didn't read. Fix: include explicit verification steps in the plan -- after writing a file, read it back and confirm the content. After running code, check the return code. Trust but verify.
def verify_completion(agent_claim, verification_steps):
"""Verify an agent's claim of task completion."""
print(f"Agent claims: {agent_claim}")
print("Running verification:")
all_passed = True
for step_name, check_fn in verification_steps:
result = check_fn()
status = "PASS" if result else "FAIL"
if not result:
all_passed = False
print(f" [{status}] {step_name}")
verdict = "CONFIRMED" if all_passed else "FAILED"
print(f"Verification: {verdict}")
return all_passed
verify_completion(
"I've written the analysis report to output.txt",
[
("File exists", lambda: True),
("File not empty", lambda: True),
("Contains expected sections", lambda: True),
("Numbers match source data", lambda: False), # oops!
]
)
Putting it in perspective
We've covered the fundamental building blocks of AI agents: the observe-think-act loop, tool use, ReAct reasoning, planning strategies, and memory management. Every agent framework out there (LangChain, CrewAI, AutoGen, you name it) is built on exactly these primitives. The frameworks add convenience and boilerplate, but the core concepts are what we built today from scratch.
Having said that, what we built is the foundation. Real-world agents need quite some more capabilities: multi-agent collaboration (multiple agents working together on different parts of a task), human-in-the-loop patterns (knowing when to ask for help in stead of guessing), error recovery strategies beyond simple retries, cost management (each tool call costs tokens -- an agent that fires off 50 tool calls gets expensive fast), and evaluation frameworks to measure whether your agent actually works. Those are the topics for part 2.
The bottom line
- An AI agent loops: observe, think, act, observe -- until the task is complete or max steps is reached;
- Tools give agents capabilities (search, code execution, file access) but must be carefully designed: clear descriptions, typed parameters, structured output, predictable errors;
- ReAct (Thought/Action/Observation) interleaves explicit reasoning with tool use, improving reliability and making agent behavior interpretable;
- Planning decomposes complex tasks before execution -- plan-then-execute for predictable tasks, iterative replanning for messy real-world ones;
- Memory management prevents context overflow: summarize old messages, use a scratchpad for key facts, consider vector memory for long sessions;
- Common failure modes are predictable: infinite loops (track and break), hallucinated tools (validate before executing), task drift (keep the goal visible), overconfidence (verify claims);
- Security matters: sandbox tool execution, validate arguments, set timeouts. A confused agent with unrestricted access is a liability.
Exercises
Exercise 1: Build a tool registry with validation. Create a ToolRegistry class that stores tool definitions (name, description, parameter schema with types). Implement register(name, func, description, params) to add tools and validate_call(name, args) that checks: (a) tool exists, (b) all required parameters are present, (c) parameter types match the schema (string, int, float, bool, list). Test with 4 different tools and at least 8 calls -- 4 valid and 4 invalid (wrong tool name, missing param, wrong type, extra params). Print a validation report showing which calls passed and which were rejected and why.
Exercise 2: Implement a loop detector with escalating responses. Build a LoopDetector class that tracks the last N tool calls (default N=10). When it detects a repeated call pattern (same tool + same args appearing 3+ times), it should: first time, return a warning message suggesting a diferent approach; second time, force a specific "reflect" action (inject a message asking the agent to reconsider); third time, terminate with an error. Simulate an agent making 20 tool calls where calls 5-8 are identical and calls 14-17 are identical. Print the detector's response at each step.
Exercise 3: Build an agent memory system with vector retrieval. Create a class that stores every tool call and result as an embedded vector (using sentence_transformers). When the agent needs to recall previous work, it queries the memory with a natural language question and retrieves the top-3 most relevant past interactions by cosine similarity. Test it: add 15 diverse tool results (file reads, calculations, web searches) to memory, then query with 5 questions like "What did I find out about sorting algorithms?" and "What files did I read?". Print the retrieved memories for each query with similarity scores. Verify that relevant memories rank higher than irrelevant ones.