Learn AI Series (#18) - Random Forests - Wisdom of Crowds

What will I learn

You will learn why combining many weak models creates a strong one -- the ensemble principle and the math that makes it work;
bagging (bootstrap aggregating) -- how to build diverse trees from random samples of your data;
random feature selection -- the trick that makes random forests dramatically more powerful than plain bagging;
out-of-bag evaluation -- free validation without needing a separate holdout set;
how many trees you actually need (and why adding more can never overfit);
feature importance -- letting the forest tell you which inputs drive predictions;
random forests for both classification and regression, with scikit-learn;
hyperparameter tuning with the Pipeline and GridSearchCV tools from episode #16;
when random forests are all you need and when they fall short.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#18) - Random Forests - Wisdom of Crowds

At the end of episode #17 I was very upfront about something. We had just built a complete decision tree classifier from scratch -- Gini impurity, information gain, recursive splitting, prediction, visualization, the whole thing. It worked. It was interpretable. You could print the learned logic as human-readable if/else rules. And then I said: "a single tree is rarely the best model for any task." That probably stung a little, right after building one by hand ;-)

The reason is the instability problem. Change a few training samples and the tree can look completely different -- different features at the root, different thresholds, different structure entirely. We saw that unlimited trees memorize training data (100% training accuracy, mediocre test accuracy), and pruning helps but always feels like a compromise between capacity and generalization. The bias-variance tradeoff we've been wrestling with since episode #11.

Today we solve that problem. Not by building a better tree, but by building a forest of trees. Hundreds of them. Each one slightly different. Each one imperfect. And when they vote together on a prediction, the result is remarkably robust and accurate. The individual errors cancel out. It's one of those ideas that sounds almost too good to be true -- and yet it's backed by solid math and decades of empirical success.

In 1906, a statistician named Francis Galton observed something remarkable at a county fair. Visitors guessed the weight of an ox. Individually, their guesses were all over the place -- some wildly high, some absurdly low. But the median of all guesses was within 1% of the actual weight. The crowd, collectively, was smarter than any individual expert. Random forests apply exactly this principle to decision trees. Let's see how.

Why one tree isn't enough

Back in episode #17, we built a decision tree on apartment data and watched what happened with different max_depth settings. Unlimited depth: perfect training accuracy, poor test accuracy. Shallow pruning: worse training accuracy, but better generalization. The sweet spot was somewhere in the middle, and cross_val_score from episode #16 helped us find it.

But even the tuned single tree has a fundamental weakness: it's brittle. Let me show you what I mean.

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Generate classification data with nonlinear patterns
n = 300
X = np.random.randn(n, 5)
y = ((X[:, 0] + 0.5 * X[:, 1] - X[:, 2] + X[:, 3] * X[:, 4]
      + np.random.randn(n) * 0.8) > 0).astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a single tree
single_tree = DecisionTreeClassifier(max_depth=5, random_state=42)
single_tree.fit(X_train, y_train)
print(f"Single tree:  train={single_tree.score(X_train, y_train):.1%}  "
      f"test={single_tree.score(X_test, y_test):.1%}")

# Now train on a SLIGHTLY different training set
X_train2, X_test2, y_train2, y_test2 = train_test_split(
    X, y, test_size=0.2, random_state=99
)

tree2 = DecisionTreeClassifier(max_depth=5, random_state=42)
tree2.fit(X_train2, y_train2)
print(f"Same algo, different split: "
      f"train={tree2.score(X_train2, y_train2):.1%}  "
      f"test={tree2.score(X_test2, y_test2):.1%}")

# How many predictions differ between the two trees?
shared_preds_1 = single_tree.predict(X)
shared_preds_2 = tree2.predict(X)
disagree = np.mean(shared_preds_1 != shared_preds_2)
print(f"\nPrediction disagreement on full dataset: {disagree:.1%}")

Same algorithm. Same hyperparameters. Slightly different training data. And the two trees disagree on a significant fraction of predictions. This is the high variance problem we first discussed in episode #11 when we saw high-degree polynomials wobble between data points. A single tree is sensitive to exactly which samples ended up in its training set. The specific noise in THIS sample pulls the splits one way; different noise pulls them another way.

What if, in stead of trying to build one perfect tree, we built many imperfect trees and let them vote?

# Many trees, each on a random subset of data
n_trees = 100
predictions = np.zeros((len(X_test), n_trees))

for i in range(n_trees):
    # Bootstrap sample: draw n samples WITH replacement
    boot_idx = np.random.choice(len(X_train), size=len(X_train), replace=True)
    X_boot = X_train[boot_idx]
    y_boot = y_train[boot_idx]

    tree = DecisionTreeClassifier(max_depth=5, random_state=i)
    tree.fit(X_boot, y_boot)
    predictions[:, i] = tree.predict(X_test)

# Majority vote
ensemble_preds = (predictions.mean(axis=1) > 0.5).astype(int)
ensemble_acc = np.mean(ensemble_preds == y_test)
print(f"100 trees voting: test={ensemble_acc:.1%}")

The ensemble almost certainly outperforms the single tree on test data. Each individual tree overfits to its own bootstrap sample, but they overfit in different ways. Tree #7 might be wrong about sample #42, but trees #1, #3, #15, #28 and #56 get it right -- and the majority vote wins. The individual noise cancels out through aggregation. Same principle as Galton's ox-weight experiment. That's the core insight of ensemble methods, and it carries us all the way through this episode.

Bootstrap aggregating (bagging)

What we just did has a name: bagging (bootstrap aggregating), coined by Leo Breiman in 1996. The two key ingredients:

Bootstrap sampling: draw n samples from the training set with replacement. Each tree sees a slightly different dataset -- some samples appear multiple times, others don't appear at all. Roughly 63% of the original samples appear in each bootstrap sample, and ~37% are left out.
Aggregating: combine the predictions by majority vote (classification) or averaging (regression).

Let me verify that 63% number, because it's not obvious why it's 63% and not, say, 50% or 80%.

# Verify the ~63% coverage of bootstrap sampling
n_samples = 240  # roughly len(X_train)

coverages = []
for trial in range(1000):
    boot_idx = np.random.choice(n_samples, size=n_samples, replace=True)
    unique_fraction = len(np.unique(boot_idx)) / n_samples
    coverages.append(unique_fraction)

avg_coverage = np.mean(coverages)
print(f"Average unique fraction over 1000 trials: {avg_coverage:.3f}")
print(f"Theoretical value (1 - 1/e):              {1 - 1/np.e:.3f}")

The math: the probability of a specific sample NOT being picked in any single draw is (1 - 1/n). Over n independent draws, that probability is (1 - 1/n)^n. As n grows, this converges to 1/e ~ 0.368. So each tree misses about 36.8% of the data and sees about 63.2%. This is a fundamental result from probability (you might recognize it from episode #9 if you remember the exponential limit). The exact same math shows up in a completely different context here -- and that ~37% of unseen data will become very useful in a moment.

Having said that, bagging alone helps, but there's a problem. If your dataset has one very strong feature, every single tree will split on it first. The trees end up correlated -- they make similar decisions, they make similar mistakes, and the voting doesn't cancel errors as effectively as you'd want. The trees are diverse in terms of which samples they see, but not diverse in terms of which features they use.

Random feature selection: the key innovation

This is where Leo Breiman's stroke of genius comes in. Random forests add a second source of randomness on top of bootstrap sampling: at each split, the tree only considers a random subset of features.

If you have 10 features, a standard decision tree examines all 10 at every split and picks the best. A random forest tree might only look at 3 random features at each split (typically sqrt(n_features) for classification, n_features/3 for regression). If the best feature globally is feature #0, some trees won't even have the option to split on it at certain nodes. They're forced to use feature #3 or feature #7 or feature #9 in stead. This forces the trees to learn different aspects of the data.

# Manually build a random forest with feature subsampling
def manual_random_forest(X_train, y_train, X_test,
                         n_trees=100, max_features='sqrt'):
    n_features = X_train.shape[1]

    if max_features == 'sqrt':
        n_select = int(np.sqrt(n_features))
    elif max_features == 'log2':
        n_select = int(np.log2(n_features))
    else:
        n_select = n_features

    all_predictions = []

    for i in range(n_trees):
        # Bootstrap sample
        boot_idx = np.random.choice(
            len(X_train), size=len(X_train), replace=True
        )

        # sklearn handles max_features internally during tree construction,
        # not by subsetting columns beforehand. At each split node, it
        # randomly selects n_select features and only considers those.
        tree = DecisionTreeClassifier(
            max_features=n_select,
            random_state=i
        )
        tree.fit(X_train[boot_idx], y_train[boot_idx])
        all_predictions.append(tree.predict(X_test))

    # Majority vote
    preds = np.array(all_predictions)
    return (preds.mean(axis=0) > 0.5).astype(int)

# Compare: bagging (all features) vs random forest (sqrt features)
bag_preds = manual_random_forest(
    X_train, y_train, X_test,
    n_trees=200, max_features=5  # all 5 features
)
rf_preds = manual_random_forest(
    X_train, y_train, X_test,
    n_trees=200, max_features='sqrt'  # sqrt(5) ~ 2 features per split
)

bag_acc = np.mean(bag_preds == y_test)
rf_acc = np.mean(rf_preds == y_test)

print(f"Bagging (all features):       test={bag_acc:.1%}")
print(f"Random forest (sqrt features): test={rf_acc:.1%}")

With 5 features, sqrt(5) ~ 2, so each split considers only 2 random features out of 5. This forces real diversity -- one tree might split on features 0 and 3 at a given node, another on features 1 and 4. They discover different patterns in the data. And because they're diverse, their errors are less correlated, which means the majority vote is more effective at canceling noise.

This is the full random forest algorithm. Two layers of randomness: (1) each tree trains on a random bootstrap sample of the data, and (2) each split within each tree considers a random subset of features. Together, they produce a collection of diverse, decorrelated trees whose combined predictions are far more accurate and stable than any individual tree.

Random forests with scikit-learn

Of course, scikit-learn wraps all of this into the same clean API we set up in episode #16. Same fit/predict/score pattern. Same pipeline compatibility. You know the drill by now ;-)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

print(f"Random Forest: train={rf.score(X_train, y_train):.1%}  "
      f"test={rf.score(X_test, y_test):.1%}")

# Cross-validation for a robust estimate (episode #13 & #16)
cv_scores = cross_val_score(rf, X, y, cv=5)
print(f"CV accuracy: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")

Notice how little code that is. One line to create the model, one line to train, one line to evaluate. Same interface as DecisionTreeClassifier from episode #17, same interface as LogisticRegression from episode #16, same interface as Ridge from episode #16. That's the beauty of sklearn's consistent API -- you switch algorithms by changing the class name. Everything else stays the same.

Let me walk you through the key hyperparameters, because these are the knobs you'll be tuning:

n_estimators: number of trees in the forest. More is better (with diminishing returns). Default: 100.
max_depth: maximum depth per tree. None means unlimited (each tree grows until pure or min_samples is hit).
max_features: features to consider at each split. 'sqrt' for classification (default), 1.0 or n/3 for regression. This is THE random forest hyperparameter -- it controls the decorrelation between trees.
min_samples_leaf: minimum samples per leaf node. Higher values create simpler trees.
min_samples_split: minimum samples to attempt a split. Same concept as in episode #17.
n_jobs: number of CPU cores to use for parallel training. Set to -1 to use all cores.

That last one is important and worth highlighting. Each tree in a random forest is completely independent from every other tree. Tree #1 doesn't need to wait for tree #99 to finish. This means random forests are embarrassingly parallel -- you can train 100 trees on 8 CPU cores and get close to an 8x speedup. Try that with gradient descent (episode #7) where each step depends on the previous one. Trees just work in parallel. This is a massive practical advantage for large datasets.

How many trees do you actually need?

One of the nicest properties of random forests is that more trees never hurts. Unlike polynomial regression (episode #11) where adding complexity leads to overfitting, adding more trees to a random forest just makes the vote more stable. Each new tree is trained on a different bootstrap sample with different random feature subsets -- it brings new information to the ensemble without memorizing the same noise harder.

But there are diminishing returns. Let's see the convergence:

print(f"{'Trees':>6s}  {'Test Accuracy':>14s}  {'CV Accuracy':>14s}")
print("-" * 38)

for n_trees in [1, 5, 10, 25, 50, 100, 200, 500]:
    rf = RandomForestClassifier(
        n_estimators=n_trees, random_state=42
    )
    rf.fit(X_train, y_train)
    test_acc = rf.score(X_test, y_test)
    cv_acc = cross_val_score(rf, X, y, cv=5).mean()
    print(f"{n_trees:>6d}  {test_acc:>14.3f}  {cv_acc:>14.3f}")

Performance jumps rapidly from 1 to ~50 trees. Between 50 and 100 there's usually still some improvement. Beyond 200 you're mostly paying compute for negligible gains. In practice, 100-300 trees is the sweet spot for most problems. I usually start with 100 and only increase it if cross-validation shows it's still improving.

The theoretical reason: adding trees reduces the variance of the ensemble without increasing bias. Each individual tree is a high-variance, low-bias estimator (especially if grown deep). The averaging process of the ensemble reduces variance (by the law of large numbers, roughly). The bias stays the same because each tree is still an unrestricted decision tree -- it can capture complex patterns. So the ensemble inherits the low bias of individual trees while dramatically reducing their high variance. You get the best of both worlds.

This is fundamentally different from the complexity tradeoff in single models. When you increase max_depth on a single tree, you reduce bias but increase variance. When you add more trees to a forest, you reduce variance without touching bias. That's why random forests "can't overfit by adding more trees" -- a claim that sounds wrong but is mathematically justified (as long as each tree is trained on an independent bootstrap sample with random feature subsets).

Out-of-bag evaluation: free validation

Remember that each tree misses ~37% of the training data because of bootstrap sampling? Those left-out samples are called out-of-bag (OOB) samples. Here's the clever part: for each training sample, you can collect predictions from only the trees that didn't train on it, and use those predictions as a validation estimate. No need for a separate validation set.

rf_oob = RandomForestClassifier(
    n_estimators=200, oob_score=True, random_state=42
)
rf_oob.fit(X_train, y_train)

print(f"OOB score:  {rf_oob.oob_score_:.3f}")
print(f"Test score: {rf_oob.score(X_test, y_test):.3f}")

# Compare to 5-fold cross-validation
cv = cross_val_score(rf_oob, X_train, y_train, cv=5)
print(f"CV score:   {cv.mean():.3f} +/- {cv.std():.3f}")

The OOB score is typically very close to the cross-validation score -- it's a legitimate estimate of generalization performance. The advantage: it comes for free during training. You don't need to re-train 5 separate models like in K-fold cross-validation. For each sample, you just check what the trees that didn't see it would predict.

This is especially valuable when your dataset is small and you can't afford to hold out a big validation set. With OOB, you use ALL your data for training and still get a reliable performance estimate. In practice, I often use OOB for quick checks and fall back to full cross-validation for the final evaluation (as we established in episode #13).

Feature importance: which inputs matter?

Random forests naturally provide a measure of how important each feature is for predictions. The idea is straightforward: features that produce large Gini impurity reductions across many trees and many split points are important; features that are rarely selected or produce small reductions are not.

Scikit-learn computes this automatically:

rf_imp = RandomForestClassifier(n_estimators=200, random_state=42)
rf_imp.fit(X_train, y_train)

importances = rf_imp.feature_importances_
feature_names = [f"feature_{i}" for i in range(X.shape[1])]

print("\nFeature importance (Gini-based):")
for name, imp in sorted(zip(feature_names, importances),
                         key=lambda x: -x[1]):
    bar = "#" * int(imp * 50)
    print(f"  {name}: {imp:.3f} {bar}")

This connects directly to what we did in episode #15 with feature selection. Remember the three methods we covered -- correlation analysis, model-based importance, and permutation importance? The random forest's .feature_importances_ is the model-based approach, but now averaged over hundreds of trees in stead of relying on a single model's weights. That averaging makes it more stable and reliable.

A practical note: these Gini-based importances can be misleading when features have very different scales or cardinalities. High-cardinality features (many unique values) tend to get inflated importance because they offer more potential split points. For a more robust measure, you can use permutation importance -- the same technique we built from scratch in episode #15, but now applied to the forest. Scikit-learn has it built in:

from sklearn.inspection import permutation_importance

perm_imp = permutation_importance(
    rf_imp, X_test, y_test, n_repeats=10, random_state=42
)

print("\nPermutation importance (on test set):")
for name, imp_mean, imp_std in sorted(
    zip(feature_names, perm_imp.importances_mean,
        perm_imp.importances_std),
    key=lambda x: -x[1]
):
    print(f"  {name}: {imp_mean:.3f} +/- {imp_std:.3f}")

Permutation importance shuffles one feature at a time and measures how much the model's performance drops. If shuffling a feature destroys accuracy, that feature was crucial. If shuffling it changes nothing, the feature was irrelevent. We built this from scratch in episode #15 -- now you see it applied at scale with the random forest as the underlying model ;-)

Hyperparameter tuning with pipelines

Let's bring in the full pipeline and grid search toolkit from episode #16. Even though random forests don't need feature scaling (they split on thresholds, same as the single tree from episode #17), the pipeline framework gives us clean, reproducible experiments.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Pipeline: model only (no scaling needed for trees)
pipe = Pipeline([
    ('model', RandomForestClassifier(random_state=42)),
])

# Parameter grid
param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [None, 5, 10],
    'model__max_features': ['sqrt', 'log2'],
    'model__min_samples_leaf': [1, 3, 5],
}

grid = GridSearchCV(
    pipe, param_grid, cv=5,
    scoring='f1', n_jobs=-1,
    return_train_score=True
)
grid.fit(X_train, y_train)

print(f"Best parameters: {grid.best_params_}")
print(f"Best CV F1:      {grid.best_score_:.3f}")
print(f"Test F1:         {grid.score(X_test, y_test):.3f}")

Notice I used scoring='f1' here, same as we did for the decision tree grid search in episode #17. Remember from episode #13 -- accuracy can be misleading with imbalanced data, and F1 balances precision and recall. Also notice n_jobs=-1 on the grid search itself. That's a double parallelization: the grid search distributes different parameter combinations across CPU cores, and within each combination, the random forest distributes its trees across cores. In practice, set n_jobs on one of them (usually the grid search) and leave the other at default.

Let's look at the full evaluation:

y_pred = grid.predict(X_test)

print("\n--- Test Set Results ---\n")
print(classification_report(
    y_test, y_pred,
    target_names=['class 0', 'class 1']
))

# Show top 5 parameter combinations
print("Top 5 configurations by CV F1:")
results = list(zip(
    grid.cv_results_['mean_test_score'],
    grid.cv_results_['std_test_score'],
    grid.cv_results_['params']
))
results.sort(key=lambda x: -x[0])

for rank, (mean_s, std_s, params) in enumerate(results[:5], 1):
    depth = params.get('model__max_depth', 'None')
    feats = params['model__max_features']
    leaf = params['model__min_samples_leaf']
    trees = params['model__n_estimators']
    print(f"  #{rank}: F1={mean_s:.3f} (+/-{std_s:.3f})  "
          f"trees={trees} depth={depth} feat={feats} leaf={leaf}")

Same classification_report from episode #16. Same grid search framework. Same pipeline structure. The algorithm changed from DecisionTreeClassifier to RandomForestClassifier -- but everything around it stayed identical. That's the payoff of the consistent sklearn API. You invest once in learning the framework, and then switching models is trivial.

Random forests for regression

Everything we've covered works for regression too. In stead of majority vote, the forest averages the continuous predictions from all trees:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Generate regression data with nonlinear patterns
np.random.seed(42)
n = 400
X_reg = np.random.randn(n, 5)
y_reg = (3 * X_reg[:, 0]**2
         + 2 * X_reg[:, 1]
         - X_reg[:, 2]
         + X_reg[:, 3] * X_reg[:, 4]
         + np.random.randn(n) * 0.5)

X_tr_r, X_te_r, y_tr_r, y_te_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Compare: single tree vs forest
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=5, random_state=42)
tree_reg.fit(X_tr_r, y_tr_r)

rf_reg = RandomForestRegressor(n_estimators=200, random_state=42)
rf_reg.fit(X_tr_r, y_tr_r)

print(f"{'Model':>25s}  {'Train R2':>10s}  {'Test R2':>10s}  {'Test RMSE':>10s}")
print("-" * 59)

for name, model in [("Single Tree (depth=5)", tree_reg),
                    ("Random Forest (200 trees)", rf_reg)]:
    tr_r2 = model.score(X_tr_r, y_tr_r)
    te_r2 = model.score(X_te_r, y_te_r)
    te_rmse = np.sqrt(mean_squared_error(
        y_te_r, model.predict(X_te_r)
    ))
    print(f"{name:>25s}  {tr_r2:>10.3f}  {te_r2:>10.3f}  {te_rmse:>10.3f}")

The random forest should comfortably beat the single tree on test data, especially on this dataset that has genuine nonlinear and interaction patterns (X[:, 0]**2 and X[:, 3] * X[:, 4]). Remember from episode #17 how single trees discover interactions automatically? The forest does the same thing, but with the stability and noise-reduction of the ensemble.

The regression forest still has the same weakness as individual regression trees: predictions are step-functions. Each tree divides the feature space into rectangular regions, and within each region, predicts the average training value. The forest averages many step functions, producing a smoother result -- but it still can't extrapolate beyond the range of the training data. If your training data has target values between 0 and 100 and a new sample comes in that should be 150, the forest will never predict 150. It's bounded by the training range. Linear regression doesn't have this limitation because it's a formula that extends infinitely in both directions. Keep this in mind when working with data that might have out-of-range values.

Comparing random forests against everything we've learned

Let's put random forests in context with a head-to-head comparisson against the models from previous episodes. Same data, same cross-validation, same metrics:

from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

models = {
    "Logistic Regression": Pipeline([
        ('scaler', StandardScaler()),
        ('model', LogisticRegression(max_iter=1000)),
    ]),
    "Ridge Classifier": Pipeline([
        ('scaler', StandardScaler()),
        ('model', RidgeClassifier(alpha=1.0)),
    ]),
    "Decision Tree (tuned)": DecisionTreeClassifier(
        max_depth=5, min_samples_leaf=3, random_state=42
    ),
    "Random Forest": RandomForestClassifier(
        n_estimators=200, random_state=42
    ),
}

print(f"{'Model':>25s}  {'CV Accuracy':>12s}  {'CV F1':>10s}")
print("-" * 51)

for name, model in models.items():
    acc = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    f1 = cross_val_score(model, X, y, cv=5, scoring='f1')
    print(f"{name:>25s}  {acc.mean():>8.3f} +/- {acc.std():.3f}  "
          f"{f1.mean():>6.3f}")

A few things to note. The linear models (Logistic Regression and Ridge Classifier) need StandardScaler in a pipeline -- we established that in episodes #11 and #16. The tree-based models don't. The random forest should be competitive with or better than the linear models on this dataset, because the data has nonlinear patterns and interactions that linear models can't capture without manual feature engineering (which we'd need to do like in episode #15).

This is one of the key selling points of random forests: they handle nonlinearity and feature interactions automatically. No polynomial features needed (episode #15). No interaction terms. No cyclical encoding for periodic features. The trees just split wherever the data says to split, and the forest makes it robust. For tabular data (spreadsheet-style, not images or text), random forests are the reliable workhorse that pretty much always delivers solid results with minimal tuning.

When random forests are all you need (and when they fall short)

After years of working with various ML models, I've come to see random forests as the "sensible default" for most tabular data problems. Here's my honest assessment:

Strengths:

Work well out of the box with minimal hyperparameter tuning
Handle both classification and regression
Capture nonlinear relationships and feature interactions automatically
No feature scaling required (remember all the time we spent on StandardScaler in episodes #11, #14, and #16?)
Naturally resistant to outliers (trees split on thresholds, so extreme values just land in a bin)
Provide feature importances for free
Rarely overfit if you use enough trees
Parallelize trivially across CPU cores
OOB score gives free validation

Weaknesses:

Predictions are step-functions -- they approximate smooth surfaces with rectangles, not curves
Cannot extrapolate beyond the training data range
Slower to predict than linear models (each sample traverses hundreds of trees)
Less interpretable than a single decision tree (though feature importance helps)
Can struggle with very high-dimensional sparse data (like text with 50,000 vocabulary terms)
Generally outperformed on competition leaderboards by gradient boosting methods

That last point is worth expanding on. In structured data competitions (Kaggle and the like), the winning solution is almost always some form of gradient boosting -- an ensemble method that builds trees sequentially in stead of independently, where each new tree focuses on correcting the mistakes of the previous ones. Random forests build all trees in parallel and aggregate; gradient boosting builds them one at a time and stacks corrections. The sequential approach typically squeezes out a few extra percentage points of performance. We'll get into exactly how and why that works in the next episode.

But here's my take: for 90% of real-world problems (not competitions), random forests give you "good enough" results with much less effort. You don't need to win a Kaggle competition to solve a business problem. A random forest with 200 trees, default hyperparameters, and minimal tuning will get you to a strong baseline in minutes. From there, you can decide whether optimizing further is worth the additional complexity.

A real-world-ish end-to-end example

Let me put everything together. We'll use our running apartment dataset (with the interaction patterns from episode #15) and build a complete random forest pipeline:

from sklearn.metrics import mean_absolute_error

# Realistic apartment data (same patterns as episodes #15 and #17)
np.random.seed(42)
n = 500

sqm = np.random.uniform(30, 150, n)
rooms = np.random.randint(1, 6, n).astype(float)
age = np.random.uniform(0, 50, n)
floor = np.random.randint(0, 10, n).astype(float)
has_elevator = np.random.randint(0, 2, n).astype(float)

price = (2500 * sqm
         + 800 * rooms
         - 300 * age
         - 4000 * floor * (1 - has_elevator)
         + 20 * sqm * (50 - age) / 50
         + np.random.randn(n) * 12000)

X_apt = np.column_stack([sqm, rooms, age, floor, has_elevator])
apt_features = ["sqm", "rooms", "age", "floor", "elevator"]

X_tr_a, X_te_a, y_tr_a, y_te_a = train_test_split(
    X_apt, price, test_size=0.2, random_state=42
)

# Compare: linear model WITH engineered features vs
# random forest WITHOUT engineered features
from sklearn.linear_model import Ridge

# Linear model needs the interaction feature we built in ep #15
floor_no_elev_tr = X_tr_a[:, 3] * (1 - X_tr_a[:, 4])
sqm_age_tr = X_tr_a[:, 0] * X_tr_a[:, 2]
X_tr_eng = np.column_stack([X_tr_a, floor_no_elev_tr, sqm_age_tr])

floor_no_elev_te = X_te_a[:, 3] * (1 - X_te_a[:, 4])
sqm_age_te = X_te_a[:, 0] * X_te_a[:, 2]
X_te_eng = np.column_stack([X_te_a, floor_no_elev_te, sqm_age_te])

# Scale for linear model
scaler = StandardScaler()
X_tr_eng_s = scaler.fit_transform(X_tr_eng)
X_te_eng_s = scaler.transform(X_te_eng)

ridge = Ridge(alpha=1.0)
ridge.fit(X_tr_eng_s, y_tr_a)

# Random forest: raw features, no engineering, no scaling
rf_apt = RandomForestRegressor(
    n_estimators=200, random_state=42, oob_score=True
)
rf_apt.fit(X_tr_a, y_tr_a)

print("=== Apartment Price Prediction ===\n")
print(f"{'Model':>35s}  {'Test R2':>8s}  {'Test RMSE':>12s}  "
      f"{'Test MAE':>10s}")
print("-" * 70)

for name, pred_te in [
    ("Ridge + engineered features",
     ridge.predict(X_te_eng_s)),
    ("Random Forest (raw features only)",
     rf_apt.predict(X_te_a)),
]:
    r2 = r2_score(y_te_a, pred_te)
    rmse = np.sqrt(mean_squared_error(y_te_a, pred_te))
    mae = mean_absolute_error(y_te_a, pred_te)
    print(f"{name:>35s}  {r2:>8.4f}  EUR {rmse:>8,.0f}  EUR {mae:>7,.0f}")

print(f"\nRandom Forest OOB R2: {rf_apt.oob_score_:.4f}")

Look at what just happened. The Ridge model needed us to manually engineer the floor * (1 - elevator) interaction feature and the sqm * age interaction feature, plus StandardScaler, just to capture the patterns in the data. The random forest matched or beat it using the raw features alone -- no engineering, no scaling. The trees discovered the interactions automatically by splitting on floor after splitting on elevator (which IS the interaction, as we discussed in episode #17).

That's the practical power of random forests. They reduce the feature engineering burden dramatically. You still SHOULD think about your features (domain knowledge is always valuable), but the bar for "good enough" is much lower with tree-based methods than with linear models.

Let's check what the forest learned about feature importance:

print("\nWhat the forest learned -- feature importances:")
for name, imp in sorted(
    zip(apt_features, rf_apt.feature_importances_),
    key=lambda x: -x[1]
):
    bar = "#" * int(imp * 40)
    print(f"  {name:>10s}: {imp:.3f}  {bar}")

The sqm feature should dominate (it's the strongest predictor with a coefficient of 2500 in the true relationship). The forest might rank age and floor as important too, because they both contribute to the true price formula. The elevator feature might show moderate importance because it interacts with floor -- the forest needs both to capture the joint effect.

Let's recap

We went from a single unstable tree to a forest of hundreds of diverse trees today, and the improvement was dramatic. Here's what we covered:

Ensemble methods combine many weak models into a strong one. Individual tree errors cancel through majority vote (classification) or averaging (regression) -- the same "wisdom of crowds" principle that Galton observed in 1906;
Bootstrap sampling gives each tree a different view of the data: ~63% of samples per tree (1 - 1/e), drawn with replacement. This is the "bagging" part of the algorithm;
Random feature selection at each split forces tree diversity -- the key innovation that makes random forests far more powerful than plain bagging. With sqrt(n_features) random features per split, even if one feature dominates, many trees are forced to explore alternatives;
Out-of-bag evaluation uses the ~37% of data each tree didn't see as a free validation set. No separate holdout needed, and the OOB score closely approximates cross-validation;
More trees = better, with diminishing returns beyond ~100-200. Unlike single models, adding trees reduces variance without increasing bias -- random forests genuinely cannot overfit by adding more trees;
Feature importances rank how much each input contributes to the predictions, averaged across hundreds of trees. Use permutation importance (from episode #15) for a more robust measure than the default Gini-based importances;
Random forests handle nonlinearity and interactions automatically -- the floor * elevator interaction we had to manually engineer in episode #15 for linear models, the trees discover by themselves;
The sklearn API is consistent: RandomForestClassifier and RandomForestRegressor use the same fit/predict/score pattern from episode #16, plug into the same Pipeline and GridSearchCV framework;
Random forests are the pragmatic default for tabular data: minimal tuning, no feature scaling, strong results, embarrassingly parallel. They're rarely the absolute best model on any given problem, but they're almost never bad. A strong baseline to beat.

The one area where random forests consistently lose? Competition leaderboards, where the last few percent of accuracy matter. That's where sequential ensemble methods -- which build trees that learn from each other's mistakes in stead of independently -- tend to win. That's a fundamentally different approach to combining trees, and understanding how it works requires understanding the concept of "boosting."

Bedankt voor het lezen! Questions or suggestions, drop them below ;-)

Hive account@scipio

Learn AI Series (#18) - Random Forests - Wisdom of Crowds

Learn AI Series (#18) - Random Forests - Wisdom of Crowds

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#18) - Random Forests - Wisdom of Crowds

Why one tree isn't enough

Bootstrap aggregating (bagging)

Random feature selection: the key innovation

Random forests with scikit-learn

How many trees do you actually need?

Out-of-bag evaluation: free validation

Feature importance: which inputs matter?

Hyperparameter tuning with pipelines

Random forests for regression

Comparing random forests against everything we've learned

When random forests are all you need (and when they fall short)

A real-world-ish end-to-end example

Let's recap

Bedankt voor het lezen! Questions or suggestions, drop them below ;-)

Curriculum (of the `Learn AI Series`):