Learn AI Series (#12) - Classification - Logistic Regression From Scratch
What will I learn
- You will learn the difference between regression and classification -- predicting categories, not numbers;
- the sigmoid function and why it squishes any number into the 0-1 range;
- why MSE doesn't work for classification and what binary cross-entropy does instead;
- how to derive and implement the gradient for logistic regression;
- how decision boundaries separate classes in feature space;
- how to implement logistic regression from scratch in pure NumPy -- same philosophy as episode #10;
- how threshold choice creates a tradeoff between different types of errors;
- the complete from-scratch model packaged as a reusable class.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch (this post)
Learn AI Series (#12) - Classification - Logistic Regression From Scratch
Eleven episodes of predicting numbers. Apartment prices. Salaries. Continuous values on a scale where the answer can be 47,382.19 or 301,008.55 or anything in between. Linear regression (#10), polynomial regression, regularization, the normal equation (#11) -- all of it aimed at one thing: given some features, predict a number.
But what if the thing you're trying to predict isn't a number at all?
Is this email spam or not? Is this tumor malignant or benign? Will this customer leave or stay? Did this transaction come from the cardholder or from a thief? These are classification problems -- the answer is a category, not a number. And they're everywhere. I'd argue that in practice, classification is more common than regression. Most real business questions boil down to "which bucket does this thing belong in?" not "what exact value does this thing have?"
Today we build our first classification model from absolute scratch. Same philosophy as episode #10 -- no scikit-learn, no PyTorch, just NumPy and understanding. By the end of this episode you'll have a working logistic regression classifier that you built yourself, and you'll understand every line of math behind it. And here's the beautiful part: once you understand logistic regression, you're 80% of the way to understanding neural networks. The sigmoid activation, the cross-entropy loss, the gradient computation -- all of it carries straight over. This episode is a gateway ;-)
Let's go.
Why not just use linear regression for classification?
Your first instinct might be perfectly reasonable: "Can't I just predict 0 for 'not spam' and 1 for 'spam' using the linear regression model we already built?" It's a natural thought. The label IS a number (0 or 1), and linear regression predicts numbers. Let's try it and see what goes wrong:
import numpy as np
np.random.seed(42)
# Simple classification problem: predict spam (1) or not spam (0)
# Feature: number of exclamation marks in the email
exclamation_marks = np.array([0, 1, 1, 2, 3, 3, 4, 5, 7, 8, 10, 12, 15])
is_spam = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1])
# Try linear regression -- same approach as episode #10
X = np.column_stack([exclamation_marks, np.ones(len(exclamation_marks))])
w = np.linalg.lstsq(X, is_spam, rcond=None)[0]
predictions = X @ w
print("Linear regression for classification:\n")
for em, actual, pred in zip(exclamation_marks, is_spam, predictions):
label = "spam" if pred > 0.5 else "not spam"
print(f" {em:>2d} marks -> predicted {pred:>5.2f} -> {label:>8s} "
f"(actual: {'spam' if actual else 'not spam'})")
Run it and look at the numbers. Two problems jump out immediately.
Problem 1: predictions go outside [0, 1]. Some predictions are negative, some are above 1. What does a "probability of -0.15" even mean? Or a "probability of 1.3"? Nothing. It's nonsensical. A probability must be between 0 and 1 -- that's not my opinion, that's the mathematical definition of a probability. Linear regression doesn't know about this constraint. It happily produces any number on the real line.
Problem 2: outlier sensitivity. Add one email with 100 exclamation marks (an extreme outlier) and watch the entire line shift. The decision boundary moves, and emails that were correctly classified before suddenly get misclassified. The linear model is trying to minimize squared error across ALL points, so one extreme value drags the whole line toward it. That's fine for regression (you want the line close to all points), but terrible for classification where you care about getting the boundary RIGHT, not about minimizing distance to every point equally.
We need something that takes any number and squishes it into the 0-1 range, no matter what. Something that behaves like a probability. That something is the sigmoid function.
The sigmoid: squishing everything into (0, 1)
The sigmoid function takes any real number and maps it to a value strictly between 0 and 1:
sigmoid(z) = 1 / (1 + e^(-z))
That's it. One formula. Let me show you what it does to different inputs:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# See what sigmoid does to different values
z_values = np.array([-10, -5, -2, -1, 0, 1, 2, 5, 10])
print("Sigmoid function demo:\n")
for z in z_values:
print(f" sigmoid({z:>3d}) = {sigmoid(z):.6f}")
Look at the pattern:
- Large negative input (like -10) produces output very close to 0. The model is saying "almost certainly NOT this class."
- Large positive input (like +10) produces output very close to 1. "Almost certainly IS this class."
- Input of 0 produces exactly 0.5. Maximum uncertainty -- the model has no clue.
- Everything in between smoothly transitions from 0 to 1.
And crucially: the sigmoid is differentiable everywhere. No discontinuities, no sharp corners. That means gradient descent works (remember from episode #9 -- we need differentiable functions to compute gradients). This is not a coincidence. The sigmoid was chosen precisely because it has the right output range AND plays nicely with calculus.
So here's the key idea. Logistic regression is just linear regression with a sigmoid wrapped around it:
y_hat = sigmoid(X @ w + b) = sigmoid(z)
We compute the same weighted sum of features as in episode #10 (z = X @ w + b), but then we pass it through the sigmoid. The output is now a probability between 0 and 1. "This email has a 93% probability of being spam." "This transaction has a 12% probability of being fraudulent." Proper probabilities, not random numbers on the real line.
Despite the confusing name -- "logistic regression" is used for classification -- this is historically what it was called. The "regression" part refers to the fact that we're still computing a linear combination of features internally. The classification happens because we put a sigmoid on top and then threshold the result.
# Linear regression vs logistic regression on the same data
z_linear = X @ w # raw linear output (from our earlier fit)
# For logistic: recompute with sigmoid
logistic_preds = sigmoid(z_linear)
print("\nLinear vs logistic predictions:\n")
print(f"{'Marks':>5s} {'Linear':>8s} {'Logistic':>8s} {'Actual':>6s}")
for em, lin, log, act in zip(exclamation_marks, predictions, logistic_preds, is_spam):
print(f" {em:>3d} {lin:>+8.3f} {log:>8.3f} {act:>4d}")
Notice how the logistic predictions are all between 0 and 1 now. But these logistic predictions are using the weights we learned for linear regression, which isn't optimal for classification. We need a loss function designed for classification to learn proper weights. That's next.
Binary cross-entropy: the right loss for classification
Back in episode #6 we introduced MSE (Mean Squared Error) as the loss function for regression. It measures the average squared difference between predictions and targets. For regression -- predicting numbers -- MSE works great. For classification? Not so much.
Here's why. When the true label is 1 (spam) and you predict 0.99 (almost certainly spam), MSE says the error is (1 - 0.99)^2 = 0.0001. Tiny. Good. When you predict 0.01 (almost certainly NOT spam -- you're terribly wrong), MSE says the error is (1 - 0.01)^2 = 0.98. That seems like a big penalty, but let's compare it to what binary cross-entropy gives us:
def binary_cross_entropy(y_true, y_pred):
"""BCE loss -- the correct loss for binary classification."""
eps = 1e-15 # tiny value to avoid log(0)
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# Compare MSE vs BCE for different predictions when true label is 1
y_true = 1.0
test_preds = [0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 0.99]
print(f"True label: {y_true}")
print(f"\n{'Predicted':>10s} {'MSE':>8s} {'BCE':>8s}")
for p in test_preds:
mse = (y_true - p) ** 2
bce = -(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))
print(f" {p:>8.2f} {mse:>8.4f} {bce:>8.4f}")
Look at the BCE column when you predict 0.01 and the truth is 1. The penalty is around 4.6 -- compared to MSE's 0.98. BCE is MUCH harsher on confident wrong predictions. And that's exactly what we want. A classifier that confidently says "definitely not spam" when it IS spam should be punished severely. A little uncertainty is forgivable. Confident wrongness is not.
The formula:
J(w) = -(1/m) * sum[yi * log(y_hat_i) + (1 - yi) * log(1 - y_hat_i)]
Let me break this down because the formula looks intimidating but the logic is simple:
- When
yi = 1(positive class), the loss for that sample is-log(y_hat_i). If the model predicts y_hat close to 1,-log(1) = 0-- no penalty. If it predicts close to 0,-log(0.001) = 6.9-- massive penalty. The closer to zero you predict when the truth is one, the worse it gets. And it gets worse logarithmically fast. - When
yi = 0(negative class), the loss is-log(1 - y_hat_i). Same logic in reverse. Predicting close to 0 is good (no penalty), predicting close to 1 is terrible. - Average across all m samples and you have the total cost.
The np.clip in our implementation prevents log(0), which would give negative infinity. Clipping predictions to a tiny epsilon above 0 and below 1 handles this edge case gracefully.
Deriving the gradient
Here's where the chain rule from episode #9 comes back. We need the derivative of BCE loss with respect to the weights, so gradient descent knows which direction to push each weight.
The derivation involves three chained functions: the loss (BCE), the sigmoid, and the linear combination. Work through the chain rule and -- I'm not going to lie -- the algebra takes a few pages. But the final result is remarkably clean:
dJ/dw = (1/m) * X_transpose @ (y_hat - y)
If this looks familiar, it should. In episode #10, our linear regression gradient was -(2/m) * X_transpose @ (y - Xw). The logistic gradient has the same structure -- it's X.T @ errors -- but with two differences: the predictions y_hat now pass through the sigmoid, and the sign convention flips (we use y_hat - y in stead of y - y_hat, which absorbs the negative sign from the derivative of the log).
The reason this works out so cleanly is that the sigmoid's derivative has a beautiful property: sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z)). When you chain this through the log-loss derivative, terms cancel out and you get that simple formula. It's one of those moments in math where everything aligns perfectly -- and it's NOT a coincidence. Cross-entropy was specifically designed to pair well with the sigmoid. The whole thing was engineered to be elegant ;-)
def logistic_gradient(X, y, w):
"""Gradient of BCE loss with respect to weights."""
m = len(y)
z = X @ w
y_hat = sigmoid(z)
return (1/m) * X.T @ (y_hat - y)
Let's verify it numerically, the same way we did in episode #10. Nudge each weight by epsilon in both directions and compare:
# Set up a small test problem
np.random.seed(42)
X_test_grad = np.column_stack([np.random.randn(20, 2), np.ones(20)])
y_test_grad = (np.random.randn(20) > 0).astype(float)
w_test = np.array([0.5, -0.3, 0.1])
# Analytical gradient
analytical = logistic_gradient(X_test_grad, y_test_grad, w_test)
# Numerical gradient (central difference -- episode #9)
def bce_cost(X, y, w):
y_hat = sigmoid(X @ w)
eps = 1e-15
y_hat = np.clip(y_hat, eps, 1 - eps)
return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
epsilon = 1e-5
numerical = np.zeros_like(w_test)
for i in range(len(w_test)):
w_plus = w_test.copy()
w_plus[i] += epsilon
w_minus = w_test.copy()
w_minus[i] -= epsilon
numerical[i] = (bce_cost(X_test_grad, y_test_grad, w_plus) -
bce_cost(X_test_grad, y_test_grad, w_minus)) / (2 * epsilon)
print("Gradient verification:\n")
for i, name in enumerate(["w[0]", "w[1]", "bias"]):
diff = abs(analytical[i] - numerical[i])
print(f" {name}: analytical={analytical[i]:>+.6f} "
f"numerical={numerical[i]:>+.6f} diff={diff:.2e}")
If the differences are on the order of 1e-8 or smaller, our gradient formula is correct. This gradient checking technique from episode #10 is not optional -- it's your safety net. I've seen gradient bugs that produce plausible-looking training curves but converge to the wrong solution. Always verify analytically derived gradients with numerical checks before trusting them.
The complete implementation
Time to put it all together. Same structure as our LinearRegression class from episode #10 -- a class with .fit(), .predict_proba(), and .predict() methods. The training loop follows the exact same four-step pattern from episode #7: forward pass, compute loss, compute gradient, update weights.
class LogisticRegression:
def __init__(self, learning_rate=0.1, n_epochs=1000):
self.lr = learning_rate
self.n_epochs = n_epochs
self.weights = None
self.history = []
def fit(self, X, y):
# Add bias column (the bias trick from episode #10)
X_b = np.column_stack([X, np.ones(len(X))])
m, n = X_b.shape
self.weights = np.zeros(n)
for epoch in range(self.n_epochs):
# Step 1: forward pass -- predict with sigmoid
z = X_b @ self.weights
y_hat = sigmoid(z)
# Step 2: compute cost (BCE)
eps = 1e-15
y_hat_safe = np.clip(y_hat, eps, 1 - eps)
cost = -np.mean(y * np.log(y_hat_safe) +
(1 - y) * np.log(1 - y_hat_safe))
self.history.append(cost)
# Step 3: compute gradient
gradient = (1/m) * X_b.T @ (y_hat - y)
# Step 4: update weights
self.weights -= self.lr * gradient
return self
def predict_proba(self, X):
"""Return probability of class 1."""
X_b = np.column_stack([X, np.ones(len(X))])
return sigmoid(X_b @ self.weights)
def predict(self, X, threshold=0.5):
"""Return class labels (0 or 1)."""
return (self.predict_proba(X) >= threshold).astype(int)
Compare this to the LinearRegression class from episode #10. The structure is identical: initialize weights to zero, loop for n_epochs, compute forward pass + cost + gradient + update. The differences are:
- The forward pass wraps
X @ winsigmoid()instead of using it raw - The cost function is BCE instead of MSE
- The gradient formula reflects the new loss (but has the same
X.T @ errorsstructure) - We have
predict_proba()(probabilities) ANDpredict()(class labels)
That's it. The entire architecture is the same. Swap out the loss function and the activation, and regression becomes classification. This is a pattern you'll see again and again -- especially when we get to neural networks, where each layer has its own activation function and the final layer determines whether the network does regression or classification. Same training loop. Different pieces inside it.
Training on a real problem
Let's generate a synthetic classification dataset -- one where we KNOW the true decision boundary -- and train our logistic regression model on it. Same approach as episodes #7 and #10: start with data where you know the answer, verify the model discovers it.
np.random.seed(42)
n = 200
# Two features: study hours and sleep hours before exam
# Label: pass (1) or fail (0)
study_hours = np.random.uniform(0, 10, n)
sleep_hours = np.random.uniform(3, 10, n)
# True rule: you pass if 0.6*study + 0.4*sleep > 5 (plus noise)
score = 0.6 * study_hours + 0.4 * sleep_hours + np.random.randn(n) * 0.8
passed = (score > 5).astype(float)
X = np.column_stack([study_hours, sleep_hours])
y = passed
print(f"Dataset: {n} students")
print(f" Passed: {int(y.sum())} ({y.mean()*100:.0f}%)")
print(f" Failed: {int(n - y.sum())} ({(1-y.mean())*100:.0f}%)")
# Train/test split -- just like episode #10
split = int(0.8 * n)
idx = np.random.permutation(n)
X_train, X_test = X[idx[:split]], X[idx[split:]]
y_train, y_test = y[idx[:split]], y[idx[split:]]
# Feature scaling -- remember from episode #10, this is not optional
mean = X_train.mean(axis=0)
std = X_train.std(axis=0)
X_train_s = (X_train - mean) / std
X_test_s = (X_test - mean) / std # use TRAIN statistics, not test
print(f"\nTrain: {len(y_train)} samples")
print(f"Test: {len(y_test)} samples")
Now train it:
model = LogisticRegression(learning_rate=0.5, n_epochs=500)
model.fit(X_train_s, y_train)
# Evaluate
train_preds = model.predict(X_train_s)
test_preds = model.predict(X_test_s)
train_acc = np.mean(train_preds == y_train)
test_acc = np.mean(test_preds == y_test)
print(f"\nResults after training:")
print(f" Training accuracy: {train_acc:.1%}")
print(f" Test accuracy: {test_acc:.1%}")
print(f" Final BCE loss: {model.history[-1]:.4f}")
# Show convergence
print(f"\nConvergence:")
checkpoints = [0, 10, 50, 100, 200, 499]
for i in checkpoints:
print(f" Epoch {i:>4d}: loss = {model.history[i]:.4f}")
There's that convergence curve again -- the same shape from episodes #7 and #10. Steep drop early, long flat tail. The loss is different (BCE in stead of MSE), the model is different (sigmoid activation), but the training dynamics are the same. Predict, measure error, compute gradient, step downhill, repeat. The universal loop.
Decision boundaries: where the model draws the line
With regression, the model output is a continuous surface -- a hyperplane for linear regression, a curved surface for polynomial regression. With classification, the interesting thing is the decision boundary: the line (or curve, or hyperplane) where the model switches from predicting one class to the other.
For logistic regression, the decision boundary is where sigmoid(z) = 0.5, which happens when z = 0. So the boundary is where X @ w + b = 0 -- a straight line in 2D, a plane in 3D, a hyperplane in higher dimensions. Same as the linear regression hyperplane from episode #10, but now it separates classes in stead of fitting through data points.
# Our model learned 3 weights: w[0] for study, w[1] for sleep, w[2] for bias
w = model.weights
print(f"Learned weights:")
print(f" study_hours: {w[0]:>+.3f}")
print(f" sleep_hours: {w[1]:>+.3f}")
print(f" bias: {w[2]:>+.3f}")
# The decision boundary is where w[0]*study + w[1]*sleep + w[2] = 0
# In scaled space. Let's see where this boundary falls
print(f"\nDecision boundary equation (scaled space):")
print(f" {w[0]:.3f} * study_scaled + {w[1]:.3f} * sleep_scaled "
f"+ {w[2]:.3f} = 0")
# At different study hours, what sleep hours do you need to pass?
if abs(w[1]) > 0.01:
print(f"\nBoundary in scaled coordinates:")
for study_s in [-2, -1, 0, 1, 2]:
sleep_s = -(w[0] * study_s + w[2]) / w[1]
# Convert back to original space
study_orig = study_s * std[0] + mean[0]
sleep_orig = sleep_s * std[1] + mean[1]
print(f" Study {study_orig:.1f}h -> need {sleep_orig:.1f}h sleep to be on the boundary")
This is the model saying: "if you study 2 hours, you need about X hours of sleep to have a 50/50 chance of passing. Study 8 hours and you only need Y hours." The boundary is a tradeoff curve -- more of one compensates for less of the other. The weights determine the exchange rate.
Inspecting individual predictions
Let me show you something that logistic regression gives you for free and that many other classifiers don't: calibrated probabilities. The model doesn't just say "pass" or "fail" -- it says "73% chance of passing" or "18% chance of passing." This is incredibly useful in practice because it tells you how confident the model is.
# Get probabilities for test set
test_probas = model.predict_proba(X_test_s)
# Show individual predictions sorted by confidence
sorted_idx = np.argsort(test_probas)
print("Sample predictions (sorted by confidence):\n")
print(f"{'Study':>6s} {'Sleep':>6s} {'P(pass)':>8s} {'Predicted':>10s} {'Actual':>8s}")
for i in sorted_idx[::4]: # show every 4th for readability
study_h = X_test[i, 0]
sleep_h = X_test[i, 1]
prob = test_probas[i]
pred = "pass" if prob >= 0.5 else "fail"
actual = "pass" if y_test[i] == 1 else "fail"
marker = " <-- WRONG" if pred != actual else ""
print(f" {study_h:>5.1f} {sleep_h:>5.1f} {prob:>8.3f} {pred:>10s} "
f"{actual:>6s}{marker}")
Notice the wrong predictions. They tend to happen near the decision boundary -- where the probability is close to 0.5. The model is uncertain there, which makes sense. Points far from the boundary (high study AND high sleep, or low study AND low sleep) get classified with high confidence. Points right on the boundary are a coin flip. The model is honest about its uncertainty, and that honesty is valueable.
Choosing the threshold
We've been using 0.5 as the threshold -- predict class 1 if P(class 1) >= 0.5. Seems natural. But 0.5 isn't always the right choice.
Think about medical diagnosis. If you're screening for cancer, what's worse: telling someone they MIGHT have cancer when they don't (false positive, leads to more tests), or telling someone they're fine when they actually have cancer (false negative, potentially fatal)? The cost of these two types of errors is wildly asymmetric. In this case, you'd lower the threshold to maybe 0.3 or 0.2 -- catching more actual cases at the cost of more false alarms.
# Sweep different thresholds and see how predictions change
print("Threshold analysis on test set:\n")
print(f"{'Threshold':>10s} {'Accuracy':>8s} {'Pred Pass':>10s} "
f"{'Pred Fail':>10s} {'FP':>4s} {'FN':>4s}")
for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:
preds = (test_probas >= threshold).astype(int)
acc = np.mean(preds == y_test)
# False positives: predicted pass but actually failed
fp = np.sum((preds == 1) & (y_test == 0))
# False negatives: predicted fail but actually passed
fn = np.sum((preds == 0) & (y_test == 1))
pred_pos = preds.sum()
pred_neg = len(preds) - pred_pos
print(f" {threshold:>8.1f} {acc:>8.1%} {pred_pos:>10d} "
f"{pred_neg:>10d} {fp:>4d} {fn:>4d}")
See the tradeoff? Lower thresholds catch more positives (fewer false negatives) but also create more false alarms (more false positives). Higher thresholds are more conservative -- fewer false alarms but more misses. The "best" threshold depends entirely on your problem's costs. For spam filtering, a false positive (legitimate email marked as spam) is annoying but recoverable. A false negative (spam getting through) is mildly irritating. Costs are roughly symmetric, so 0.5 is fine. For fraud detection, a false negative (missed fraud) costs money. A false positive (blocking a legitimate transaction) annoys a customer but prevents nothing. You'd lower the threshold. For cancer screening, you'd lower it even more.
This tradeoff between false positives and false negatives is one of the most important practical concepts in classification. We'll formalize it properly with precision, recall, confusion matrices, and ROC curves in an upcoming episode on evaluation. For now, just understand that the threshold is a knob you turn based on the relative cost of different mistakes.
A more challenging problem
Let's push our model with a dataset that has more features and a less obvious decision boundary:
# Simulate a medical screening problem
np.random.seed(42)
n = 300
# Features: age, blood pressure, cholesterol, exercise hours/week
age = np.random.uniform(25, 75, n)
blood_pressure = 80 + 0.5 * age + np.random.randn(n) * 15
cholesterol = 150 + 0.8 * age + np.random.randn(n) * 30
exercise = np.random.uniform(0, 10, n)
# Risk: higher age, BP, cholesterol increase risk; exercise decreases it
risk_score = (0.04 * age + 0.02 * blood_pressure + 0.01 * cholesterol
- 0.5 * exercise - 5 + np.random.randn(n) * 1.5)
high_risk = (risk_score > 0).astype(float)
X = np.column_stack([age, blood_pressure, cholesterol, exercise])
y = high_risk
print(f"Medical screening dataset: {n} patients")
print(f" High risk: {int(y.sum())} ({y.mean()*100:.0f}%)")
print(f" Low risk: {int(n - y.sum())} ({(1-y.mean())*100:.0f}%)")
# Split, scale, train -- the pipeline from episode #11
split = int(0.8 * n)
idx = np.random.permutation(n)
X_train, X_test = X[idx[:split]], X[idx[split:]]
y_train, y_test = y[idx[:split]], y[idx[split:]]
mean = X_train.mean(axis=0)
std = X_train.std(axis=0)
X_train_s = (X_train - mean) / std
X_test_s = (X_test - mean) / std
model_med = LogisticRegression(learning_rate=0.5, n_epochs=1000)
model_med.fit(X_train_s, y_train)
train_acc = np.mean(model_med.predict(X_train_s) == y_train)
test_acc = np.mean(model_med.predict(X_test_s) == y_test)
print(f"\nTraining accuracy: {train_acc:.1%}")
print(f"Test accuracy: {test_acc:.1%}")
print(f"Final loss: {model_med.history[-1]:.4f}")
# Which features matter most?
feature_names = ["age", "blood_pressure", "cholesterol", "exercise"]
w = model_med.weights
print(f"\nFeature importance (weight magnitude):")
for name, wi in zip(feature_names, w[:-1]):
direction = "increases" if wi > 0 else "decreases"
print(f" {name:>16s}: {wi:>+.3f} ({direction} risk)")
print(f" {'bias':>16s}: {w[-1]:>+.3f}")
Look at the weights. They tell a story: age, blood pressure, and cholesterol have positive weights (more = higher risk), while exercise has a negative weight (more = lower risk). That matches our data-generating process. The magnitude tells you relative importance -- which feature has the strongest effect on the prediction. This interpretability is one of logistic regression's biggest strengths, same as we discussed for linear regression in episode #10. You can explain to a doctor: "your model says each extra year of age adds X to the risk score, and each extra hour of weekly exercise reduces it by Y."
Multiclass: a preview
Everything we've built today handles binary classification -- two classes. But many real problems have more than two categories. What type of news article is this? (Sports, politics, tech, entertainment, ...) What digit is written in this image? (0, 1, 2, ..., 9.) What species of flower? (Setosa, versicolor, virginica.)
There are two standard approaches for multiclass:
One-vs-Rest (OvR): Train K binary classifiers, one per class. Each classifier asks "is it this class, or everything else?" At prediction time, run all K classifiers and pick the one with the highest probability. Simple, works well, scales linearly with the number of classes.
Softmax regression: Generalize the sigmoid to K classes. Instead of squishing one number into (0,1), the softmax function takes K numbers and turns them into K probabilities that sum to 1. This is mathematically cleaner than OvR and is what neural networks use for their final layer in classification tasks.
We'll encounter softmax properly when we get to neural networks. For now, the binary logistic regression we built today is the foundation. Every concept -- sigmoid activation, cross-entropy loss, gradient-based training, decision boundaries, threshold selection -- carries forward directly.
Let's recap
We made the leap from regression to classification today. Let me make sure every key concept is solid:
- Classification predicts categories (spam/not spam, pass/fail), not continuous numbers. The output should be a probability between 0 and 1;
- Linear regression fails for classification because its predictions go outside [0,1] and it's too sensitive to outliers;
- The sigmoid function
1 / (1 + e^(-z))squishes any real number into the (0, 1) range -- perfect for probabilities. It's smooth and differentiable, so gradient descent works; - Binary cross-entropy is the correct loss for classification. It penalizes confident wrong predictions logarithmically -- much harsher than MSE. This forces the model to be careful about confident predictions;
- The gradient is
(1/m) * X.T @ (y_hat - y)-- strikingly similar to linear regression's gradient, but with sigmoid-transformed predictions. Always verify gradients numerically; - The decision boundary is where the model is exactly 50% uncertain (
sigmoid(z) = 0.5, meaningz = 0). It's a hyperplane in feature space, positioned by training to separate the classes; - Threshold selection controls the tradeoff between false positives and false negatives. 0.5 is a reasonable default, but the optimal threshold depends on the relative cost of different errors;
- Logistic regression's weights are directly interpretable: positive weight means the feature pushes toward class 1, negative means it pushes toward class 0, and magnitude indicates importance;
- The
.fit()/.predict()/.predict_proba()pattern extends naturally from what we built for linear regression. Same skeleton, different internals.
This was the last "from scratch" model we'll build in this initial run. We've now implemented both linear regression (predicting numbers) and logistic regression (predicting categories) from pure NumPy, understanding every formula and every line of code. But there's a question we've been sidestepping: how do you actually know if your model is any good? Accuracy is a start, but it can be deeply misleading. What if 95% of your data is one class? A model that always predicts the majority class gets 95% accuracy while being completely useless. We need better tools for measuring model quality -- and that's where we're headed next.