Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
What will I learn
- You will learn why "eyeballing" patterns doesn't scale beyond simple problems;
- what a loss function is and why it's the most important concept in ML;
- Mean Squared Error explained step by step -- no shortcuts;
- what "minimizing the loss" actually means, with a concrete analogy;
- what parameters are -- the knobs a model can turn;
- how gradient descent works at a conceptual level, before the formulas arrive.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas (this post)
Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
We've been doing pretty well without math so far, haven't we? Over the last four episodes we built predictions, measured errors, found patterns, and compared models -- all with intuition and some NumPy operations. In episode #4 we built a full K-Nearest Neighbors predictor from scratch. In episode #5 we tried different lines and compared their errors. No calculus. No derivatives. No Greek letters. Just code and thinking.
So why can't we just keep going like this?
Because intuition breaks. It breaks hard, and it breaks precisely at the moment you need it most. When you have 50 features in stead of 2. When you have a million data points in stead of 10. When the pattern lives in a space you can't visualize -- and trust me, 50-dimensional space is NOT something your brain handles gracefully (mine certainly doesn't). At some point, you need a systematic way to answer one question: "what are the best parameters for my model, given this data?"
That systematic way is math. But I'm not going to dump equations on you. In stead, I want to show you why each piece of math exists -- what problem it solves that intuition can't. By the time we actually hit the formulas in later episodes, they won't be abstract. You'll already know what they're for ;-)
Let's go.
Why eyeballing doesn't scale
In episode #5, we tried a few different lines and compared their errors. That worked because we had one input feature (square meters) and one output (price). A line has two parameters: slope and intercept. We could try a handful of combinations, compute the error for each, and pick the best one. Easy.
Now imagine you have 50 features. Your model has 51 parameters (one weight per feature plus a bias term). How many combinations do you need to try?
import numpy as np
# With 2 parameters, trying 100 values each = 10,000 combinations
# With 51 parameters, trying 100 values each = 100^51 combinations
# That's 10^102 -- more than the number of atoms in the universe
n_features = 50
n_params = n_features + 1 # weights + bias
values_per_param = 100
print(f"Parameters to tune: {n_params}")
print(f"Values to try per parameter: {values_per_param}")
print(f"Total combinations: {values_per_param}^{n_params} = 10^{n_params * 2}")
print(f"Atoms in observable universe: ~10^80")
print(f"\nBrute force is not an option.")
That's 10^102 combinations. The observable universe has roughly 10^80 atoms. You'd need more computers than there are atoms to try every combination -- and that's with only 100 candidate values per parameter. In practice you'd want thousands of values per parameter to get any reasonable resolution. The combinatorial explosion is total and absolute.
This is why we need math. Not because math is elegant (although it is), but because brute force is physically impossible for real-world problems. We need a smarter strategy. And that strategy starts with asking a precise question: "how wrong is my model, exactly?"
The loss function: turning "how wrong" into a number
Remember in episode #4 where we computed errors casually -- the difference between predicted and actual values? We used MAE and RMSE to measure how bad our predictions were. The loss function formalizes this into a single number that summarizes how wrong the model is across ALL data points.
I'm going to say something strong here: the loss function is the most important concept in all of machine learning. (I know I said that about the training loop in episode #1, and about overfitting too, but hear me out -- they're all connected.) The loss function defines what "good" means. Every ML algorithm -- from the linear regression we'll build soon to GPT with its hundreds of billions of parameters -- is fundamentally solving the same problem: find the parameters that make the loss function as small as possible.
Let's build up to the most common loss function, step by step. No shortcuts.
Step 1: the error for one prediction
You predicted EUR 220,000 for an apartment that actually sold for EUR 185,000. How wrong are you?
actual = 185000
predicted = 220000
error = actual - predicted
print(f"Error: {error:+,}") # -35,000
The error is -35,000. The negative sign means we predicted too high. Simple enough. But for measuring overall model quality, we have a problem -- positive and negative errors cancel each other out. A prediction that's EUR 50,000 too high and another that's EUR 50,000 too low average to zero error. That doesn't mean the model is perfect. It means our metric is lying to us.
Step 2: squared error -- making all errors positive
We need all errors to be positive. We could use absolute values (and that gives us MAE, which we saw in episode #4). But squaring has nicer mathematical properties -- it's differentiable everywhere (important for gradient descent, coming up soon) and it penalizes large errors more heavily than small ones. An error of EUR 100,000 doesn't just count twice as much as EUR 50,000 -- it counts four times as much. This is actually a desirable property in most cases.
squared_error = error ** 2
print(f"Squared error: {squared_error:,}") # 1,225,000,000
A big number, sure, but that's fine -- we care about relative comparisons between models, not the absolute magnitude.
Step 3: average across all predictions
We want one number for the whole model, not one number per prediction. So we take the average of all the squared errors:
# Our model's predictions for 5 apartments
actuals = np.array([185000, 210000, 145000, 320000, 165000], dtype=np.float64)
predictions = np.array([200000, 195000, 160000, 300000, 175000], dtype=np.float64)
errors = actuals - predictions
squared_errors = errors ** 2
mse = squared_errors.mean()
print(f"Individual errors: {errors}")
print(f"Squared errors: {squared_errors}")
print(f"MSE (loss): {mse:,.0f}")
That final number -- the Mean Squared Error (MSE) -- is our loss. Lower MSE = better model. If we change our predictions and the MSE goes down, we improved. If it goes up, we got worse. Simple, unambiguous, mathematical. No more "well, this line looks like it fits better" -- now we have a number. And numbers don't lie (well, they can be misleading, but that's a topic for when we cover evaluation properly).
Having said that, MSE is not the only loss function. We already met MAE in episode #4, and there are others -- Huber loss, log loss for classification, cross-entropy for probability outputs. Different problems call for different loss functions. But MSE is the default starting point, and understanding it deeply gives you the foundation for all the rest.
What "minimizing the loss" means
This is where the mental model gets really powerful. Imagine the loss function as a landscape -- hills and valleys. Each point on this landscape represents a specific set of parameters (slope value, intercept value, all the weights). The height at that point is the loss -- how wrong the model is with those particular parameter values.
"Minimizing the loss" means finding the lowest valley in this landscape. The bottom of the deepest valley is where the model performs best.
Remember the error landscape from episode #4? We swept through possible predictions from EUR 100,000 to EUR 350,000 and found that the mean (EUR 205,000) sat at the bottom of a smooth parabola. That was a 1D landscape -- one parameter, one valley. Now we're extending that to higher dimensions.
# Visualize for a simple case: one parameter (slope)
# We fix intercept at 0 and vary the slope
sqm = np.array([40, 60, 80, 100, 120], dtype=np.float64)
price = np.array([105000, 155000, 205000, 260000, 310000], dtype=np.float64)
slopes = np.arange(1000, 4000, 100)
losses = []
for s in slopes:
preds = s * sqm
loss = ((price - preds) ** 2).mean()
losses.append(loss)
losses = np.array(losses)
best_idx = losses.argmin()
best_slope = slopes[best_idx]
best_loss = losses[best_idx]
print(f"Best slope: {best_slope}")
print(f"Loss at best slope: {best_loss:,.0f}")
print(f"\nLoss landscape (sampled):")
for i in range(0, len(slopes), 5):
bar = "#" * int(50 * (1 - losses[i] / losses.max()))
marker = " <-- best" if slopes[i] == best_slope else ""
print(f" slope={slopes[i]:>4d} loss={losses[i]:>15,.0f} {bar}{marker}")
See that valley? The loss drops as the slope approaches the sweet spot, and rises again if you go past it. This grid search works for one parameter. But as we discussed -- with 51 parameters, the landscape has 51 dimensions, and you can NOT search it exhaustively. The landscape has more points than atoms in the universe.
We need a way to navigate this landscape efficiently. And that's what gradient descent does.
Parameters: the knobs the model can turn
Before we get to gradient descent, let's be really precise about what a "parameter" is. This is a word that gets thrown around a lot, and I want to make sure we're on the same page (because later, when we talk about hyperparameters, the distinction matters).
Consider the simplest prediction model:
prediction = slope * feature + intercept
The slope and intercept are parameters. They're the adjustable parts -- the knobs the model turns to fit the data. The features are fixed (they come from your data). The parameters are what the model "learns."
# A model with 2 parameters
def model(sqm, slope, intercept):
return slope * sqm + intercept
# Different parameters = different predictions
params_a = (2000, 20000) # slope=2000, intercept=20000
params_b = (2600, 5000) # slope=2600, intercept=5000
preds_a = model(sqm, *params_a)
preds_b = model(sqm, *params_b)
loss_a = ((price - preds_a) ** 2).mean()
loss_b = ((price - preds_b) ** 2).mean()
print(f"Params A (slope={params_a[0]}, intercept={params_a[1]}):")
print(f" Predictions: {preds_a}")
print(f" Loss: {loss_a:,.0f}\n")
print(f"Params B (slope={params_b[0]}, intercept={params_b[1]}):")
print(f" Predictions: {preds_b}")
print(f" Loss: {loss_b:,.0f}")
The model is its parameters. Two identical architectures with different parameter values make completely different predictions. Training = finding better parameter values. A trained model = a set of parameter values that produce low loss.
Now scale this up. A neural network with 1 billion parameters has 1 billion adjustable numbers. Training it means finding good values for all billion simultaneously. The exact same principle, at a scale that requires extremely clever optimization. (And when people talk about how much it costs to train something like GPT -- the electricity bills, the thousands of GPUs -- they're talking about the computational cost of searching through a billion-dimensional landscape for the lowest valley. Billion-dimensional. Makes 51 dimensions seem cute, doesn't it? ;-))
Gradient descent: the blindfolded mountain analogy
Alright, here comes the key idea. No calculus. Just a metaphor, and then code.
Imagine you're blindfolded on a hilly landscape, and you need to find the lowest point. You can't see the terrain. But you CAN feel the ground under your feet -- you can tell which direction slopes downward from where you're currently standing. So you take a step downhill. Then feel the slope again. Step downhill. Feel the slope. Repeat until the ground stops descending -- every direction either goes up or stays flat. You've found a valley.
That's gradient descent. The "slope you feel" is the gradient -- a mathematical measurement of which direction is downhill and how steep it is. The "step you take" is controlled by the learning rate -- a number that determines how big each step is.
Let's see it work. We'll start with a terrible guess for the slope and let gradient descent improve it:
# Gradient descent for a SINGLE parameter (slope only, intercept=0)
# The "gradient" tells us: should we increase or decrease the slope?
slope = 1000.0 # Start with a bad guess
learning_rate = 0.00000001 # How big each step is
print("Gradient descent in action:\n")
for step in range(10):
# Current predictions and loss
preds = slope * sqm
loss = ((price - preds) ** 2).mean()
# The gradient: which direction should slope change?
# (Don't worry about this formula -- we'll derive it properly later)
gradient = -2 * (sqm * (price - preds)).mean()
# Take a step in the opposite direction of the gradient
slope = slope - learning_rate * gradient
print(f" Step {step}: slope={slope:>8.1f} loss={loss:>15,.0f} gradient={gradient:>12,.0f}")
print(f"\nFinal slope: {slope:.1f}")
Watch what happens. The slope starts at 1000 (way too low -- our data needs something around 2600). Each step, the gradient tells us "increase the slope" and the model nudges it upward. The loss drops every step. The model is learning, right in front of your eyes.
Don't worry about the gradient formula -- I put it there so the code runs, but we'll derive it properly when we cover calculus. What matters right now is the process:
- Start with random (bad) parameters
- Compute the loss (how wrong you are)
- Compute the gradient (which direction improves things)
- Adjust parameters in that direction
- Repeat until the loss stops improving
This is the training loop from episode #1, made concrete and specific. And it works. Not just for our little slope-and-intercept model -- for any model where you can compute a gradient. Neural networks with billions of parameters use this exact same loop. The only differences are the model architecture (more complex), the loss function (problem-specific), and the optimizer (fancier versions of gradient descent). But the skeleton? Identical.
The learning rate: why step size matters
The learning rate is deceptively important. It's a single number, but it controls whether training succeeds or fails catastrophically.
Too large? You overshoot the valley -- bouncing back and forth across it like a ping-pong ball, never settling. The loss might actually increase each step, or oscillate wildly, or explode to infinity.
Too small? You crawl toward the valley so slowly it takes forever to get anywhere. In theory you'll converge eventually. In practice, you'll burn through your compute budget before the model learns anything useful.
# Same problem, different learning rates
for lr_name, lr in [("too small", 1e-10), ("just right", 1e-8), ("too large", 1e-6)]:
s = 1000.0
print(f"\nLearning rate: {lr_name} ({lr})")
for step in range(5):
preds = s * sqm
loss = ((price - preds) ** 2).mean()
grad = -2 * (sqm * (price - preds)).mean()
s = s - lr * grad
if abs(s) > 1e10:
print(f" Step {step}: DIVERGED! slope={s:.0e}")
break
print(f" Step {step}: slope={s:>10.1f} loss={loss:>15,.0f}")
Run it and compare. With the "too small" learning rate, the slope barely moves. With "just right," it converges nicely. With "too large," it blows up -- the loss explodes, the slope shoots to infinity, the model is completely useless.
Finding the right learning rate is a practical skill that every ML practitioner develops. Most people start with something like 0.001 or 0.01 and adjust from there. There are also adaptive techniques (learning rate scheduling, optimizers like Adam that automatically adjust the step size per-parameter) that we'll cover later. For now, just know that this one number can make or break your training. Yes, really. I've seen production models fail because someone didn't tune the learning rate. It's that important.
And here's something to sit with: remember when we talked about the hyperparameter K in KNN back in episode #4? K was a setting you chose before the model ran. The learning rate is the same kind of thing -- a hyperparameter. The model doesn't learn it from data. You set it, and if you set it wrong, nothing else matters.
A complete walk-through: two parameters at once
Let's level up. In stead of just a slope, let's optimize both slope AND intercept simultaneously. This is what real training looks like -- multiple parameters being adjusted together at every step:
# Gradient descent with TWO parameters: slope and intercept
slope = 1000.0
intercept = 50000.0
learning_rate = 0.00000001
print("Training with two parameters:\n")
for step in range(15):
preds = slope * sqm + intercept
loss = ((price - preds) ** 2).mean()
# Gradient for each parameter
grad_slope = -2 * (sqm * (price - preds)).mean()
grad_intercept = -2 * (price - preds).mean()
# Update both simultaneously
slope = slope - learning_rate * grad_slope
intercept = intercept - learning_rate * grad_intercept
if step % 3 == 0 or step == 14:
print(f" Step {step:>2d}: slope={slope:>8.1f} intercept={intercept:>9.1f} loss={loss:>15,.0f}")
print(f"\nFinal model: price = {slope:.0f} * sqm + {intercept:.0f}")
# How good is this model?
final_preds = slope * sqm + intercept
print(f"\nPredictions vs actual:")
for s, pred, act in zip(sqm, final_preds, price):
print(f" {s:>3.0f} sqm: predicted EUR {pred:>9,.0f} | actual EUR {act:>9,.0f}")
Both parameters move at every step, each nudged in the direction that reduces the loss. The slope converges toward the right value, the intercept adjusts to compensate, and the loss drops steadily. This is a two-dimensional version of our blindfolded hiker -- now they're navigating a 2D landscape in stead of a 1D line, but the principle is identical. Feel the slope in every direction, step downhill.
Putting it all together
Let me give you the full picture of where we are after six episodes:
Episodes #1-5 gave us intuition:
- ML finds patterns in data (episode #1)
- Data is numbers organized as features and targets (episode #3)
- We can predict by finding similar examples or fitting lines (episode #4)
- Some features are useful, some are noise (episode #5)
- The error landscape has a minimum, and we want to find it (episode #4)
Episode #6 (this one) bridges to math:
- Intuition doesn't scale -- 51 parameters means 10^102 combinations. Brute force is physicaly impossible.
- The loss function turns "how wrong" into a single number (MSE). This is what you minimize.
- Parameters are the adjustable knobs. The model IS its parameters. Training = finding good values.
- Gradient descent finds the best parameters by walking downhill -- no need to search the entire landscape.
- The learning rate controls step size: too big = overshoot and diverge, too small = crawl forever.
Next episode, we'll put this all together into a complete working training loop -- watching a model learn step by step, loss going down, predictions getting better, the whole thing running from start to finish. Then after that, we'll cover the linear algebra and calculus that make gradient descent rigorous. By then the math won't feel abstract -- you'll already know exactly what problem each formula is solving, because you've seen it work in code first.
Wat heb je geleerd?
- Brute-force parameter search fails beyond one or two parameters -- 50 features means 10^102 combinations, more than atoms in the universe;
- The loss function (MSE) turns model quality into a single number: lower = better;
- MSE works by squaring errors (making them all positive) and averaging -- it penalizes large errors disproportionately, which is usually what you want;
- Parameters are the model's adjustable values -- training means finding the values that minimize the loss;
- Gradient descent is "feel which way is downhill, take a step, repeat" -- it navigates the loss landscape without searching every point;
- The learning rate controls step size -- too large overshoots and diverges, too small crawls uselessly;
- This is the formalization of the intuition we've been building since episode #1: measure error, adjust, repeat. Just now it's precise, scalable, and mathematical.