Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
What will I learn
- You will learn the difference between bagging and boosting -- two fundamentally different ensemble strategies that start from opposite ends of the bias-variance spectrum;
- how AdaBoost reweights samples to focus on mistakes, building an ensemble of tiny decision stumps;
- how gradient boosting fits trees to residuals -- turning sequential weakness into strength, one correction at a time;
- the connection between residual fitting and gradient descent in function space (building on the training loop from episode #7);
- why XGBoost dominated machine learning competitions for nearly a decade;
- how LightGBM and CatBoost improve on the original idea with histogram binning and native categorical support;
- practical hyperparameter tuning and early stopping for gradient-boosted models;
- a head-to-head comparison of random forests versus gradient boosting on real data.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion (this post)
Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
At the end of episode #18 I teased what was coming next. We had just built a random forest -- hundreds of trees trained independently on bootstrap samples, voting together, errors canceling through diversity. And I said: "the one area where random forests consistently lose? Competition leaderboards." I mentioned that sequential ensemble methods -- methods where each tree learns from the mistakes of the ones before it -- typically squeeze out a few extra points of performance. Today we build those methods.
If random forests win through democracy (every tree gets an equal vote, no tree knows what the others decided), gradient boosting wins through coaching. Each new tree is a specialist, trained to fix exactly what the ensemble still gets wrong. The first tree makes a rough prediction. The second tree looks at the errors and tries to correct them. The third tree corrects the corrections. And so on, hundreds of times, until the residual error is tiny. It's the single most dominant approach for structured data competitions, and understanding it completes our tour of the classical ML algorithm families before we move into more exotic territory.
Let's dive right in.
Two philosophies of ensembles
In episode #18 we established that ensemble methods combine multiple weak models to create a strong one. But there are two fundamentally different strategies for HOW you combine them, and they come at the problem from opposite sides of the bias-variance tradeoff (which we first met in episode #11 with polynomial regression).
Bagging (what random forests do): build many trees independently on random subsets of data. Each tree is full-depth, high-variance, low-bias. It memorizes its training sample pretty well but wobbles a lot between different training samples. Combining many such trees reduces variance -- the wobbles average out -- while preserving the low bias. We saw this in action last episode.
Boosting: build trees sequentially. Each tree is shallow (a "stump" or small tree, often just max_depth=3), high-bias, low-variance. A shallow tree underfits -- it can't capture complex patterns alone. But each new tree specifically targets what the ensemble still gets wrong, systematically chipping away at the bias. Combining many sequential corrections reduces bias while keeping variance manageable.
Think of it this way. Bagging averages out noise. Boosting systematically eliminates errors. Both work, but boosting usually squeezes out more accuracy on structured data -- at the cost of being more sensitive to hyperparameters and more prone to overfitting if you're not careful.
There's a practical consequence of this difference that matters for your workflow. Remember how I said in episode #18 that you can throw 500 trees at a random forest, use default parameters, and get a reasonable result? With gradient boosting, that same negligence can lead to severe overfitting or underfitting. Boosting demands attention -- but it rewards careful tuning with better predictions. That's the tradeoff, and it's worth knowing upfront before we write a single line of code.
AdaBoost: the original booster
The first practical boosting algorithm was AdaBoost (Adaptive Boosting), published by Freund and Schapire in 1995. The idea is elegant: give each training sample a weight. Train a weak learner. Check which samples it gets wrong. Increase the weight of the misclassified samples, decrease the weight of the correct ones. Train the next weak learner on the reweighted data. Repeat.
The weak learner is typically a decision stump -- a tree with max_depth=1, meaning it can make exactly ONE split. One yes/no question about one feature. Barely better than random on its own. But the weighted combination of many such stumps can capture surprisingly complex decision boundaries.
Let me show you:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
np.random.seed(42)
# Generate classification data with a nonlinear boundary
# (a circle -- linear models can't draw circles)
n = 400
X = np.random.randn(n, 2)
y = ((X[:, 0]**2 + X[:, 1]**2) > 1.5).astype(int)
# Add noise: flip 5% of labels
flip_idx = np.random.choice(n, size=int(0.05 * n), replace=False)
y[flip_idx] = 1 - y[flip_idx]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Data: {n} samples, 2 features, circular boundary")
print(f"Train: {len(y_train)} Test: {len(y_test)}")
The boundary here is a circle -- points inside the circle belong to class 0, points outside to class 1. No single decision stump can draw a circle. But let's see what happens when we chain many stumps together with AdaBoost's reweighting:
def adaboost(X_train, y_train, X_test, n_estimators=50):
n = len(X_train)
weights = np.ones(n) / n # start with equal weights
models, alphas = [], []
for t in range(n_estimators):
# Train a stump (max_depth=1) on the WEIGHTED data
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X_train, y_train, sample_weight=weights)
preds = stump.predict(X_train)
# Which samples did we get wrong?
incorrect = (preds != y_train)
error = np.clip(
np.dot(weights, incorrect) / weights.sum(),
1e-10, 1 - 1e-10
)
# Alpha: how much influence this stump gets in the final vote
# More accurate stump -> higher alpha
alpha = 0.5 * np.log((1 - error) / error)
# Reweight: misclassified get higher weight, correct get lower
weights *= np.exp(alpha * np.where(incorrect, 1, -1))
weights /= weights.sum()
models.append(stump)
alphas.append(alpha)
# Final prediction: weighted vote across all stumps
predictions = sum(
a * np.where(m.predict(X_test) == 1, 1, -1)
for m, a in zip(models, alphas)
)
return (predictions > 0).astype(int)
ada_preds = adaboost(X_train, y_train, X_test)
print(f"AdaBoost (50 stumps): test accuracy = "
f"{np.mean(ada_preds == y_test):.1%}")
Look at what's happening here. Each stump can only make ONE split -- one yes/no question. It's barely better than a coin flip for a circular boundary. But the alpha values ensure that more accurate stumps have more say in the final vote, and the sample reweighting ensures that later stumps focus on the data points the ensemble still struggles with. After 50 rounds of this, the combination of simple stumps approximates that circular boundary quite well.
The alpha formula deserves a closer look: alpha = 0.5 * log((1 - error) / error). When error is close to 0 (a good stump), alpha is large -- that stump gets lots of influence. When error is close to 0.5 (a useless stump), alpha is near 0 -- it's basically ignored. If error ever exceeds 0.5, alpha goes negative, which effectively inverts the stump's predictions (a stump that's wrong more than half the time contains useful signal if you just flip it). Mathematically clean ;-)
Gradient boosting: fitting residuals
AdaBoost works great for classification, but gradient boosting (Friedman, 2001) generalizes the boosting idea more elegantly. In stead of reweighting samples, it asks a simpler question: "what's the gap between the current prediction and the truth?" Then it trains the next tree to predict that gap -- the residual.
For regression, this is beautifully intuitive. Your ensemble currently predicts a value. The truth is different. The difference is the residual. Train a new tree to predict those residuals. Add its predictions (scaled by a small learning rate) to the ensemble. Repeat.
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
# Regression data with a complex pattern
np.random.seed(42)
X_reg = np.sort(np.random.uniform(0, 10, 200)).reshape(-1, 1)
y_reg = (np.sin(X_reg.ravel())
+ 0.5 * np.cos(3 * X_reg.ravel())
+ np.random.randn(200) * 0.2)
X_tr, X_te, y_tr, y_te = train_test_split(
X_reg, y_reg, test_size=0.2, random_state=42
)
def gradient_boosting_regression(X_train, y_train, X_test,
n_trees=100, learning_rate=0.1,
max_depth=3):
# Start with the simplest possible prediction: the mean
initial_pred = y_train.mean()
train_preds = np.full(len(X_train), initial_pred)
test_preds = np.full(len(X_test), initial_pred)
trees = []
for i in range(n_trees):
# Residuals = what we STILL get wrong
residuals = y_train - train_preds
# Fit a small tree to the residuals (not the original targets!)
tree = DecisionTreeRegressor(max_depth=max_depth)
tree.fit(X_train, residuals)
# Update predictions with a small step
train_preds += learning_rate * tree.predict(X_train)
test_preds += learning_rate * tree.predict(X_test)
trees.append(tree)
return test_preds, trees
gb_preds, trees = gradient_boosting_regression(X_tr, y_tr, X_te)
mse = mean_squared_error(y_te, gb_preds)
print(f"Gradient Boosting MSE: {mse:.4f}")
print(f"Gradient Boosting R-squared: {1 - mse / np.var(y_te):.3f}")
Step through it mentally. The first tree tries to predict the residuals from the mean -- basically "how far off is the mean from the truth?" That's a lot of leftover signal, so the first tree captures major patterns. The second tree fits the residuals of THAT correction -- smaller, more specific errors. The third tree corrects the corrections. Each successive tree tackles a smaller remaining error. After 100 trees, the combined predictions closely track the complex sin + cos pattern in our data.
The learning rate (0.1 here) is crucial. It means each tree only contributes 10% of its correction. Why not use the full correction? Because each tree's correction is imperfect -- it's fitted on noisy data. Taking small, cautious steps and letting many trees contribute reduces the chance that any single tree's noise gets amplified. It's regularization through patience, and it's the single most important concept in gradient boosting.
Why "gradient" boosting?
Here's where things connect back to episode #7 and the training loop we built for gradient descent. Fitting trees to residuals for MSE loss IS gradient descent -- but in function space rather than parameter space.
Remember from episode #9, the derivative of MSE with respect to each prediction:
Loss = (1/n) * sum( (y_i - y_hat_i)^2 )
dLoss/dy_hat_i = -2(y_i - y_hat_i) / n
The negative gradient of the loss with respect to the prediction is proportional to (y_i - y_hat_i) -- which is exactly the residual. So when we train a tree to predict residuals, we're training it to predict the negative gradient of the loss function. Each tree steps in the direction of steepest loss reduction. That's gradient descent, but in stead of adjusting numeric parameters (like we did with weights in episodes #7 and #10), we're adding entire TREES to our model.
This insight is what makes the framework so general. Change the loss function, and you get a different gradient. Binary cross-entropy for classification (remember episode #12?). Absolute error for outlier-robust regression. Quantile loss for prediction intervals. Huber loss for a smooth transition between squared and absolute error. The algorithm handles them all -- just compute the gradient, fit a tree to it, take a step. Same loop, different loss, different gradient, different behavior.
That generality is why it's called "gradient" boosting. It's boosting via gradient descent in function space.
The learning rate and overfitting
The learning rate (also called shrinkage) is the single most important hyperparameter in gradient boosting, and it interacts tightly with the number of trees. Let me demonstrate the tradeoff:
from sklearn.ensemble import GradientBoostingRegressor
# Try different learning rate / n_estimators combinations
print(f"{'lr':>6s} {'trees':>6s} {'Train MSE':>10s} {'Test MSE':>10s}")
print("-" * 38)
for lr, n_est in [(1.0, 10), (0.5, 50), (0.1, 100),
(0.1, 300), (0.05, 500), (0.01, 500)]:
gb = GradientBoostingRegressor(
learning_rate=lr, n_estimators=n_est,
max_depth=3, random_state=42
)
gb.fit(X_tr, y_tr)
tr_mse = mean_squared_error(y_tr, gb.predict(X_tr))
te_mse = mean_squared_error(y_te, gb.predict(X_te))
print(f"{lr:>6.2f} {n_est:>6d} {tr_mse:>10.4f} {te_mse:>10.4f}")
A clear pattern should emerge. High learning rate (1.0) with few trees: the model converges fast but overfits -- the training error is tiny but the test error is bad. Low learning rate (0.01) with too few trees: the model underfit because it hasn't had enough rounds to converge. The sweet spot is in between: a moderate learning rate (0.05-0.3) with enough trees (100-1000) to let it converge without overfitting.
This is the fundamental tradeoff: smaller learning rate + more trees = more computation but better generalization. It's the same intuition as our gradient descent experiments in episode #7, where smaller step sizes led to smoother convergence. The difference is that here, each "step" is an entire decision tree, not a single parameter update. More trees also means slower training, and since gradient boosting builds trees sequentially (not in parallel like random forests), training time scales linearly with the tree count. That's a real practical cost.
Sklearn's GradientBoostingClassifier
Let's use scikit-learn's implementation (same consistent API from episode #16) on our classification data:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
gb_clf = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=3,
min_samples_leaf=5,
subsample=0.8, # stochastic gradient boosting
random_state=42
)
gb_clf.fit(X_train, y_train)
print(f"GB Classifier: train={gb_clf.score(X_train, y_train):.1%} "
f"test={gb_clf.score(X_test, y_test):.1%}")
cv = cross_val_score(gb_clf, X, y, cv=5)
print(f"CV accuracy: {cv.mean():.3f} +/- {cv.std():.3f}")
Same fit/predict/score pattern as every other sklearn model from episode #16. Same cross_val_score. You could swap in RandomForestClassifier by changing one line and compare -- that's the power of the consistent API.
Notice the subsample=0.8 parameter. This is called stochastic gradient boosting -- each tree trains on a random 80% of the data. Sounds familiar, right? It's adding the same kind of randomness that bootstrap sampling gives random forests, on top of the sequential error-correction strategy of boosting. Best of both worlds: the systematic bias-reduction of boosting plus the variance-reduction of random sampling. Breiman suggested this in 1999, and it almost always helps.
XGBoost: the competition king
XGBoost (eXtreme Gradient Boosting, by Tianqi Chen, 2014) took gradient boosting and engineered it for maximum performance. It added several key innovations beyond sklearn's implementation:
- L1 and L2 regularization on leaf weights -- penalizing large predictions in any single leaf, preventing overfitting (similar in spirit to Ridge and Lasso from episode #11, but applied to tree outputs)
- A smarter tree-building algorithm that considers both information gain AND tree complexity in a single objective function
- Built-in handling of missing values -- the tree learns which direction to send missing values at each split, no imputation needed
- Parallel computation during tree construction -- while the trees themselves are built sequentially, finding the best split within each tree is parallelized across features
- Column subsampling (
colsample_bytree) -- similar to random forests'max_features, adding another layer of randomness
from xgboost import XGBClassifier # pip install xgboost
xgb = XGBClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=4,
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
subsample=0.8,
colsample_bytree=0.8,
random_state=42,
verbosity=0
)
xgb.fit(X_train, y_train)
print(f"XGBoost: train={xgb.score(X_train, y_train):.1%} "
f"test={xgb.score(X_test, y_test):.1%}")
Between 2014 and roughly 2020, XGBoost won the majority of tabular data competitions on Kaggle. It wasn't unusual to see the top ten solutions for a structured-data challenge ALL using XGBoost with different feature engineering strategies. "Just throw XGBoost at it" became a running joke in the ML community -- except the joke worked because the advice was genuinely good ;-)
The library's dominance was so thorough that it spawned an entire ecosystem of gradient boosting frameworks, each trying to improve on specific aspects. Two of those successors are now serious contenders in their own right.
LightGBM and CatBoost: the newer contenders
XGBoost proved the concept. Two later frameworks pushed specific aspects further.
LightGBM (Microsoft, 2017) solved the speed problem. Standard gradient boosting evaluates every possible split value for every feature at every node. For a feature with 100,000 unique values, that's 100,000 threshold evaluations per split candidate. LightGBM introduced histogram-based splitting: it bins continuous features into 255 discrete buckets before training starts. Finding the best split becomes scanning 255 bins in stead of 100,000 values. The speed improvement on large datasets is dramatic -- 10-20x faster, while often matching or slightly exceeding XGBoost accuracy.
LightGBM also changed how trees grow. XGBoost and sklearn's GradientBoostingClassifier grow trees level-wise -- completing one full depth level before starting the next. LightGBM grows trees leaf-wise, always splitting the leaf that reduces the loss the most, regardless of depth. This reaches lower training loss faster but can overfit more aggresively. That's why num_leaves (not max_depth) is the primary regularization knob for LightGBM. A common starting point is num_leaves=31 with max_depth=-1 (unlimited). The API follows the sklearn pattern: LGBMClassifier and LGBMRegressor with the same fit/predict interface.
CatBoost (Yandex, 2017) solved a different problem: categorical features. If you have a column like "country" or "product_category" with string values, XGBoost requires you to manually encode it first (one-hot encoding from episode #14, label encoding, or target encoding from episode #15). CatBoost handles categories natively using ordered target statistics -- a technique that computes target-based encodings using only preceding samples (in a random permutation order), avoiding the target leakage that makes naive target encoding dangerous. If your data has many categorical features, CatBoost often wins without any preprocessing whatsoever.
Both libraries install with pip and follow the familiar sklearn API pattern. In practice, the choice between the three often comes down to your data:
- LightGBM for large datasets where training speed matters
- CatBoost for categorical-heavy data where you want to skip encoding
- XGBoost as the safe default with the most mature ecosystem and documentation
Tuning gradient boosting: a staged approach
Gradient boosting has more knobs than random forests (where you could mostly get away with defaults, as we saw in episode #18). But a systematic approach works well. Rather than grid-searching a massive parameter space all at once (which we could do with GridSearchCV from episode #16, but it would take ages), I recommend tuning in stages.
The tuning priority, from most to least impactful:
- learning_rate + n_estimators -- these interact. Lower rate needs more trees. Start with 0.1 and 100-300 trees.
- max_depth -- keep trees shallow. 3-6 is the sweet spot. Deep trees overfit.
- subsample -- 0.7-0.9 usually helps. Below 0.5 tends to hurt.
- min_samples_leaf -- increase if overfitting persists after tuning depth.
- Regularization (XGBoost/LightGBM specific) --
reg_alphaandreg_lambdapenalize complex trees.
But there's an even smarter approach than grid-searching n_estimators: early stopping. Set a large number of trees (1000+), start training, and monitor performance on a validation set. When the validation performance stops improving for a given number of consecutive rounds, stop training. The model tells you how many trees it needs in stead of you guessing.
xgb_early = XGBClassifier(
learning_rate=0.05,
n_estimators=1000, # set high -- early stopping will cut it short
max_depth=4,
subsample=0.8,
random_state=42,
verbosity=0,
early_stopping_rounds=20
)
xgb_early.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False
)
print(f"Stopped at tree {xgb_early.best_iteration}")
print(f"Test accuracy: {xgb_early.score(X_test, y_test):.3f}")
Early stopping monitors the eval_set and halts training when performance hasn't improved for 20 consecutive rounds. This is more efficient than grid-searching n_estimators -- you set it once, high enough to never be the bottleneck, and let the data tell you when to stop. The resulting model uses only the trees up to the best iteration, not all 1000.
One practical note: the eval_set here should ideally be a validation set, not your test set. Using the test set for early stopping technically leaks information (you're making decisions based on test performance). For a rigorous workflow, split your data into train/validation/test. But for quick experiments and learning purposes, this is fine.
The full comparison: random forests vs gradient boosting
Now that we've covered both approaches over episodes #18 and #19, let's put them head-to-head on the same dataset with the same evaluation framework. We'll include a few other models from earlier episodes for context:
from sklearn.ensemble import (RandomForestClassifier,
GradientBoostingClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# Generate data with nonlinear patterns and interactions
np.random.seed(42)
n = 500
X_full = np.random.randn(n, 6)
y_full = ((X_full[:, 0]**2
+ 0.5 * X_full[:, 1]
- X_full[:, 2]
+ X_full[:, 3] * X_full[:, 4]
+ 0.3 * X_full[:, 5]**2
+ np.random.randn(n) * 0.5) > 0.5).astype(int)
models = {
"Logistic Regression": Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=1000)),
]),
"Decision Tree (d=5)": DecisionTreeClassifier(
max_depth=5, random_state=42
),
"Random Forest (200)": RandomForestClassifier(
n_estimators=200, random_state=42
),
"Gradient Boosting": GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1,
max_depth=3, random_state=42
),
}
print(f"{'Model':>25s} {'CV Accuracy':>12s} {'CV F1':>10s}")
print("-" * 51)
for name, model in models.items():
acc = cross_val_score(model, X_full, y_full,
cv=5, scoring='accuracy')
f1 = cross_val_score(model, X_full, y_full,
cv=5, scoring='f1')
print(f"{name:>25s} {acc.mean():>8.3f} +/- {acc.std():.3f} "
f"{f1.mean():>6.3f}")
On this data (which has genuine nonlinear patterns -- squared terms, an interaction between features 3 and 4), the linear model should lag behind the tree-based methods. The single decision tree should do OK but be unstable. The random forest should be strong and stable. And gradient boosting should match or slightly beat the random forest -- especially if we took the time to tune it properly.
That last point is important. Out-of-the-box, random forests and gradient boosting are often very close. The extra accuracy from gradient boosting typically comes from careful tuning (learning rate, depth, early stopping). If you don't have time to tune, random forests are the safer bet. If you're optimizing for every fraction of a percent, gradient boosting is where that optimization pays off.
Feature importance with gradient boosting
Just like random forests (episode #18), gradient-boosted trees provide feature importances. Let's look at what the model learned:
from sklearn.ensemble import GradientBoostingClassifier
gb_full = GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1,
max_depth=3, random_state=42
)
gb_full.fit(X_full, y_full)
feature_names = [f"feat_{i}" for i in range(X_full.shape[1])]
importances = gb_full.feature_importances_
print("Gradient Boosting -- feature importances:")
for name, imp in sorted(zip(feature_names, importances),
key=lambda x: -x[1]):
bar = "#" * int(imp * 50)
print(f" {name}: {imp:.3f} {bar}")
The forest should rank features 0, 3, and 4 as most important (they drive the nonlinear boundary: feat_0^2 and the feat_3 * feat_4 interaction). This is the same kind of automated feature selection we discussed in episodes #15 and #18 -- the model tells you which inputs matter, averaged across hundreds of sequential trees.
For even more robust importance measures, you can use permutation importance (the technique we built from scratch in episode #15 and applied with sklearn in episode #18). Permuation importance shuffles each feature independently and measures how much accuracy drops -- if shuffling a feature destroys performance, it was important. And if you want to understand individual predictions -- not just global importance -- there's a framework called SHAP that builds on game theory to attribute each feature's contribution to each specific prediction. We won't go into SHAP today, but it's good to know it exists and is particularly popular with gradient-boosted models in production settings.
When to use which: a practical guide
Now that we have the full picture -- linear models (episodes #10-12), decision trees (#17), random forests (#18), and gradient boosting (#19) -- here's my honest practical guide for tabular data:
| Aspect | Random Forest | Gradient Boosting |
|---|---|---|
| Training | Parallel (fast) | Sequential (slower) |
| Overfitting risk | Rarely overfits | Can overfit without tuning |
| Hyperparameters | Few, forgiving | Many, sensitive |
| Accuracy ceiling | Good | Higher (usually) |
| Missing values | Needs handling | XGBoost handles natively |
| Categorical data | Needs encoding | CatBoost handles natively |
| Feature scaling | Not needed | Not needed |
My workflow on a new problem: start with a random forest (200 trees, default parameters) to get a solid baseline in minutes. If that baseline is good enough for the task at hand, ship it. If I need more accuracy and have time to tune, switch to gradient boosting with early stopping. For competition-level optimization, use XGBoost or LightGBM with staged tuning (learning rate first, then depth, then regularization).
One more thing worth emphasizing: both approaches work on tabular (structured) data -- rows and columns, like a spreadsheet or a database table. Sales records, sensor readings, customer profiles, financial transactions. For images, text, and audio, the models we'll build later in this series (neural networks, convolutions, transformers) are the clear winners. But for the kind of data that most businesses actually work with day to day, gradient-boosted trees remain the state of the art. That's not going to change anytime soon.
A complete real-world-ish workflow
Let me tie everything together with our running apartment dataset. We'll build a full gradient boosting pipeline using the same data patterns from episodes #15, #17, and #18 -- including the floor-elevator interaction that linear models needed manual feature engineering to capture:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score
# Same apartment data from episodes #15, #17, #18
np.random.seed(42)
n = 500
sqm = np.random.uniform(30, 150, n)
rooms = np.random.randint(1, 6, n).astype(float)
age = np.random.uniform(0, 50, n)
floor = np.random.randint(0, 10, n).astype(float)
has_elevator = np.random.randint(0, 2, n).astype(float)
price = (2500 * sqm
+ 800 * rooms
- 300 * age
- 4000 * floor * (1 - has_elevator)
+ 20 * sqm * (50 - age) / 50
+ np.random.randn(n) * 12000)
X_apt = np.column_stack([sqm, rooms, age, floor, has_elevator])
apt_features = ["sqm", "rooms", "age", "floor", "elevator"]
X_tr_a, X_te_a, y_tr_a, y_te_a = train_test_split(
X_apt, price, test_size=0.2, random_state=42
)
# Gradient Boosting: no scaling, no feature engineering
gb_apt = GradientBoostingRegressor(
n_estimators=300,
learning_rate=0.1,
max_depth=4,
subsample=0.8,
min_samples_leaf=5,
random_state=42
)
gb_apt.fit(X_tr_a, y_tr_a)
pred_te = gb_apt.predict(X_te_a)
r2 = r2_score(y_te_a, pred_te)
rmse = np.sqrt(mean_squared_error(y_te_a, pred_te))
mae = mean_absolute_error(y_te_a, pred_te)
print("=== Apartment Price -- Gradient Boosting ===\n")
print(f"Test R-squared: {r2:.4f}")
print(f"Test RMSE: EUR {rmse:,.0f}")
print(f"Test MAE: EUR {mae:,.0f}")
# What did it learn about feature importance?
print("\nFeature importances:")
for name, imp in sorted(
zip(apt_features, gb_apt.feature_importances_),
key=lambda x: -x[1]
):
bar = "#" * int(imp * 40)
print(f" {name:>10s}: {imp:.3f} {bar}")
Same raw features, no interaction engineering, no StandardScaler. Just like the random forest in episode #18, the gradient boosting model discovers the floor * elevator interaction and the sqm * age interaction automatically. But the sequential correction mechanism lets it fit those patterns more precisely -- each new tree specifically targets the remaining prediction errors, gradually zeroing in on the true relationship.
Compare this to episode #15, where we had to manually create floor * (1 - has_elevator) and sqm * age as new columns for the linear model to have any chance of capturing those patterns. The tree-based ensembles do this work for you.
Let's recap
We've come a long way in three episodes. In #17 we built a single decision tree from scratch. In #18 we combined many trees independently via bagging and created random forests. Today in #19 we flipped the strategy and built trees sequentially, each one correcting the mistakes of the ones before it. Here's the full picture:
- Boosting builds trees sequentially, each one targeting the residual errors of the ensemble so far -- the opposite strategy from bagging's independent parallel trees. It reduces bias where random forests reduce variance;
- AdaBoost reweights samples (misclassified samples get higher weight so the next stump focuses on them). Gradient boosting generalizes this by fitting trees to the negative gradient of any differentiable loss function;
- The learning rate controls how much each tree contributes. Smaller values need more trees but generalize better -- the same patience-vs-convergence tradeoff from our gradient descent work in episode #7;
- XGBoost, LightGBM, and CatBoost are optimized implementations with key innovations: regularization on leaf weights, histogram-based splitting for speed, leaf-wise growth, and native categorical feature handling;
- Tuning priority: learning rate and n_estimators first (they interact), then max_depth, then subsample, then regularization. Early stopping with a validation set is the most practical way to find the right number of trees;
- Gradient boosting usually beats random forests on raw accuracy for tabular data, but demands more careful tuning. Random forests are the safer default; gradient boosting is what you reach for when you need every fraction of a percent;
- Both are tree-based ensembles for structured data. For images, text, and audio, we'll need fundamentally different architectures -- neural networks, which we'll build from scratch later in this series.
We've now covered the entire classical supervised learning toolkit: linear models, decision trees, random forests, gradient boosting. In the next episodes we'll explore models that take yet another approach to finding patterns -- ones that think about boundaries and distances in stead of trees and splits. The toolkit keeps growing ;-)