Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
What will I learn
- You will learn why scikit-learn became the standard ML library and how its API design makes everything we built from scratch trivially simple to use;
- the fit/predict/transform pattern -- the beautifully consistent interface across every model and preprocessor;
- how to swap between algorithms with one line of code and instantly compare results;
- Pipelines -- chaining preprocessing and modeling into a single leakage-proof object;
- ColumnTransformer -- handling mixed numeric and categorical data within one pipeline;
- cross-validation shortcuts that replace the 40+ lines of code we wrote ourselves;
- GridSearchCV and RandomizedSearchCV -- systematic hyperparameter tuning;
- classification with sklearn: LogisticRegression, classification_report, and confusion_matrix in one call;
- model persistence with joblib -- saving and loading entire pipelines for later use;
- common sklearn gotchas that trip up beginners (and some experienced people too).
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML (this post)
Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
At the end of episode #15 I teased something. I said there's a library that packages everything we've been building from scratch -- gradient descent, StandardScaler, cross-validation, feature engineering -- into a clean, tested, professional API. And I said it follows the exact same fit()/transform()/predict() pattern we've been implementing ourselves since episode #11. That library is scikit-learn (often imported as sklearn), and today we finally get our hands on it.
Now -- I want to be upfront about something. We could have started with sklearn in episode #10 and skipped all the from-scratch implementations. Many tutorials do exactly that. You'd learn to call model.fit(X, y) and model.predict(X_test) without ever understanding what those methods actually DO. You'd get results faster, sure. But you'd be operating a black box. When the model does something unexpected (and it will), you'd have zero intuition for why. You'd be googling error messages instead of reasoning about what went wrong.
We didn't do that. You spent six episodes building linear regression, logistic regression, evaluation metrics, data preparation, and feature engineering FROM SCRATCH. You wrote gradient descent by hand. You implemented the sigmoid function. You built your own StandardScaler class. You wrote K-fold cross-validation loop by loop. All of that is in your muscle memory now. And the payoff is this: when you use sklearn, there are no black boxes. Every fit() call maps to something you've built yourself. Every transform() does exactly what your own functions did, just faster and more robustly tested.
That's the entire philosophy of this series. Build it first, understand it, THEN use the professional tooling ;-)
Let's go.
Installing scikit-learn
If you followed episode #2 and set up your AI workbench, you already have NumPy installed. Sklearn is one pip install away:
# In your terminal:
# pip install scikit-learn
# Verify the installation
import sklearn
print(f"scikit-learn version: {sklearn.__version__}")
# The imports you'll use most often
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import numpy as np
print("All imports successful -- you're ready to go!")
That list of imports might look intimidating, but every single one of those maps to something you've already built from scratch. StandardScaler -- you built that in episode #11. train_test_split -- you wrote that in episode #10. confusion_matrix -- episode #13. Pipeline -- the PrepPipeline class from episode #14. It's all the same stuff, just packaged professionally.
The API pattern: fit, predict, transform
Scikit-learn's greatest design achievement is its consistency. I cannot stress this enough. Every estimator (model), every transformer (preprocessor), every utility in the entire library follows the same interface pattern. Once you learn it, you know how to use ALL of sklearn. And you already know it, because you built it yourself:
fit(X, y)-- learn from data. For a model, this means finding optimal weights. For a scaler, this means computing mean and standard deviation. Same as thefit()method on ourPrepPipelinefrom episode #14.transform(X)-- apply learned parameters to data. For a scaler, this means(x - mean) / std. Same as ourtransform()method.predict(X)-- make predictions. For regression, return continuous values. For classification, return class labels. Same as our from-scratchpredict()functions.fit_transform(X)-- convenience shortcut: fit and transform in one call. We built this too.
Let me show you the most direct comparison possible -- our from-scratch linear regression from episode #10 versus sklearn's version, side by side:
np.random.seed(42)
# Generate the same apartment data from episode #10
n = 200
sqm = np.random.uniform(30, 150, n)
rooms = np.random.randint(1, 6, n).astype(float)
age = np.random.uniform(0, 50, n)
price = 2500 * sqm + 800 * rooms - 300 * age + np.random.randn(n) * 15000
X = np.column_stack([sqm, rooms, age])
y = price
# Split (remember episode #10: train/test before anything else)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale: fit on train, transform both
# (this is EXACTLY what we built in episode #11)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train) # fit + transform
X_test_s = scaler.transform(X_test) # transform only (same params!)
# Model: fit on train, predict on test
model = LinearRegression()
model.fit(X_train_s, y_train)
predictions = model.predict(X_test_s)
print(f"R-squared: {model.score(X_test_s, y_test):.4f}")
print(f"Coefficients: {model.coef_.round(2)}")
print(f"Intercept: {model.intercept_:.2f}")
# Compare to manual metrics
rmse = np.sqrt(mean_squared_error(y_test, predictions))
mae = mean_absolute_error(y_test, predictions)
print(f"\nRMSE: EUR {rmse:,.0f}")
print(f"MAE: EUR {mae:,.0f}")
Three lines to do what took us 50+ lines from scratch in episode #10. But -- and this is the critical point -- because you've built it from scratch, you know EXACTLY what fit() does under the hood. It's running either gradient descent or the normal equation (sklearn's LinearRegression uses the normal equation by default, which we derived in episode #11) to find optimal weights. You know what transform() does -- it applies (x - mean) / std using the training statistics. You know what score() returns -- it's R-squared, the fraction of variance explained, same formula we derived in episode #11.
No black boxes. No magic. Just professional packaging around concepts you already own.
Swapping models: one line of change
Here's where sklearn's consistent API really shines. Remember how we built linear regression in episode #10 and then added Ridge and Lasso regularization in episode #11? Each one required understanding and implementing a different math modification. With sklearn, swapping between algorithms is literally changing one line:
from sklearn.neighbors import KNeighborsRegressor
models = {
"Linear": LinearRegression(),
"Ridge": Ridge(alpha=1.0),
"Lasso": Lasso(alpha=0.1),
"KNN (k=5)": KNeighborsRegressor(n_neighbors=5),
}
print(f"{'Model':>12s} {'RMSE':>10s} {'MAE':>10s} {'R-sq':>8s}")
print("-" * 44)
for name, m in models.items():
m.fit(X_train_s, y_train)
preds = m.predict(X_test_s)
rmse = np.sqrt(mean_squared_error(y_test, preds))
mae = mean_absolute_error(y_test, preds)
r2 = m.score(X_test_s, y_test)
print(f"{name:>12s} {rmse:>10,.0f} {mae:>10,.0f} {r2:>8.4f}")
Four different algorithms. Each with one line for instantiation, one for training, one for prediction. The fit/predict interface is identical across ALL of them. That last model in the list -- KNeighborsRegressor -- is an algorithm we haven't even discussed yet. It predicts by finding the K nearest training samples and averaging their target values. No gradient descent, no weights, a completely different approach. And yet it plugs into the exact same interface. You can swap it in and out without changing any other line of code.
This is why sklearn dominates ML in Python. It lets you focus on the problem -- which algorithm works best, which features matter, which hyperparameters to tune -- instead of reimplementing each algorithm from scratch every time you want to try something new.
Pipelines: chaining operations safely
Remember the data leakage problem from episode #14? The cardinal sin? Scaling before splitting, imputing from the full dataset, any situation where test information contaminates training? We built a PrepPipeline class to prevent it. Well, sklearn's Pipeline does the same thing, but more powerfully.
A Pipeline chains preprocessing steps and a final model into one object. When you call fit() on the pipeline, it calls fit_transform() on each preprocessing step (using only training data) and then fit() on the final model. When you call predict(), it calls transform() on each step and then predict() on the model. Leakage is impossible by design:
from sklearn.preprocessing import PolynomialFeatures
# Without pipeline: manual steps, leakage risk if you're not careful
# With pipeline: automatic, safe, clean
pipe = Pipeline([
('scaler', StandardScaler()),
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('model', Ridge(alpha=1.0)),
])
# fit() calls fit_transform on each step (using only training data)
pipe.fit(X_train, y_train)
# predict() calls transform on each step, then predict on the last
predictions = pipe.predict(X_test)
print(f"Pipeline R-squared: {pipe.score(X_test, y_test):.4f}")
That's it. Three preprocessing/modeling steps in one object. Scale the features, create polynomial features (remember episode #15 -- x^2 terms, interaction terms), then fit Ridge regression. The pipeline ensures that scaling parameters and polynomial feature creation are fitted on training data only. During cross-validation, each fold gets its own scaler fit, its own polynomial fit. No leakage, by construction.
Compare that to what we had to do manually in episode #14 -- carefully tracking which statistics came from which set, making sure we never accidentally called fit() on the test data. The pipeline eliminates that entire class of bugs.
Having said that, you need to understand what the pipeline is doing under the hood, which is why we built it manually first. If a pipeline gives unexpected results, you need to be able to reason about which step is causing the problem. That reasoning comes from having built it yourself.
ColumnTransformer: mixed data types in one pipeline
Real datasets mix numeric and categorical columns. Episode #14 taught you the difference -- numeric features get scaled, categorical features get one-hot encoded, and applying the wrong transformation to the wrong column type produces garbage. ColumnTransformer handles this cleanly by applying different preprocessing to different columns:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Simulated mixed data: apartments with numeric + categorical features
np.random.seed(42)
n = 300
sqm = np.random.uniform(30, 150, n)
rooms = np.random.randint(1, 6, n)
neighborhoods = np.random.choice(['centrum', 'zuid', 'west', 'oost', 'noord'], n)
age = np.random.uniform(0, 40, n)
price = 2500 * sqm + 500 * rooms - 300 * age + np.random.randn(n) * 10000
# Combine into a structured array (sklearn works with 2D arrays)
# We need object dtype to mix floats and strings
X_mixed = np.column_stack([sqm, rooms, neighborhoods, age])
# Define which columns get which preprocessing
numeric_cols = [0, 1, 3] # sqm, rooms, age -- scale these
categorical_cols = [2] # neighborhood -- one-hot encode this
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_cols),
('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
categorical_cols),
])
# Full pipeline: preprocessing + model
full_pipe = Pipeline([
('preprocess', preprocessor),
('model', Ridge(alpha=1.0)),
])
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
X_mixed, price, test_size=0.2, random_state=42
)
full_pipe.fit(X_train_m, y_train_m)
r2 = full_pipe.score(X_test_m, y_test_m)
print(f"Mixed-type pipeline R-squared: {r2:.4f}")
# Peek at what the preprocessor created
X_transformed = preprocessor.fit_transform(X_train_m)
print(f"\nInput features: {X_mixed.shape[1]} columns (3 numeric + 1 categorical)")
print(f"After preprocessing: {X_transformed.shape[1]} columns "
f"(3 scaled numeric + {X_transformed.shape[1] - 3} one-hot)")
Numeric columns get StandardScaler. The categorical neighborhood column gets OneHotEncoder -- which does exactly what we built manually in episode #14 (binary column per category, no implied ordering). Everything happens inside the pipeline, so cross-validation works correctly. Each fold computes its own scaling statistics and its own one-hot categories from the training portion only.
The handle_unknown='ignore' parameter is a practical detail worth noting. If the test set contains a neighborhood that wasn't in the training set, the encoder sets all the one-hot columns to zero for that sample in stead of crashing. In real production systems, you WILL encounter previously unseen categories, so this is a smart default.
Cross-validation shortcuts
Remember the 40+ lines of code we wrote for K-fold cross-validation in episode #13? The fold splitting, the per-fold scaling, the metric collection, the mean and standard deviation computation? Sklearn does it in one line:
# Quick 5-fold cross-validation -- one line!
scores = cross_val_score(
Ridge(alpha=1.0), X_train_s, y_train,
cv=5, scoring='r2'
)
print(f"5-Fold CV R-squared scores: {scores.round(3)}")
print(f"Mean: {scores.mean():.3f} +/- {scores.std():.3f}")
One line. The same result as our 40-line from-scratch implementation in episode #13, but with all the edge cases handled. Each fold gets its own independent train/validation split. If you pass a pipeline in stead of a raw model, the pipeline's preprocessing is re-fitted on each fold's training data. No leakage. Stratified splitting (which we implemented manually in episode #13) is the default for classification tasks.
You can use cross_val_score with any model, any scoring metric, any number of folds. Try different scoring values: 'r2', 'neg_mean_squared_error', 'neg_mean_absolute_error' for regression; 'accuracy', 'f1', 'precision', 'recall', 'roc_auc' for classification. The neg_ prefix on regression metrics is because sklearn maximizes scores internally, and lower MSE is better, so it negates it. A minor convention thing -- just remember to negate the results back when reading them.
# Cross-validation with a Pipeline (leakage-proof by design)
pipe_cv = Pipeline([
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0)),
])
scores_pipe = cross_val_score(pipe_cv, X_train, y_train, cv=5, scoring='r2')
print(f"\nPipeline CV R-squared: {scores_pipe.mean():.3f} +/- {scores_pipe.std():.3f}")
# Multiple metrics at once
from sklearn.model_selection import cross_validate
results = cross_validate(
pipe_cv, X_train, y_train, cv=5,
scoring=['r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'],
return_train_score=True
)
print(f"\nMulti-metric CV results:")
print(f" Test R-sq: {results['test_r2'].mean():.3f} +/- {results['test_r2'].std():.3f}")
print(f" Test RMSE: {np.sqrt(-results['test_neg_mean_squared_error'].mean()):,.0f}")
print(f" Test MAE: {-results['test_neg_mean_absolute_error'].mean():,.0f}")
print(f" Train R-sq: {results['train_r2'].mean():.3f} (sanity check)")
That return_train_score=True is a nice touch -- it lets you check if the model is overfitting (big gap between train and test scores) or underfitting (both scores are low). We discussed this tradeoff back in episode #11 when we swept polynomial degrees and watched the training error drop while test error climbed. Same concept, one parameter flag.
GridSearchCV: finding the best hyperparameters
Hyperparameters are the settings you choose BEFORE training -- regularization strength (alpha), polynomial degree, number of neighbors in KNN. They can't be learned from data directly; you have to try different values and see which works best. In episode #11 we did this manually by looping over alpha values and tracking the best. GridSearchCV automates this completely:
pipe_grid = Pipeline([
('scaler', StandardScaler()),
('model', Ridge()),
])
# Define the parameter grid
# Note the syntax: step_name__parameter_name
param_grid = {
'model__alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
}
# Grid search with 5-fold cross-validation
search = GridSearchCV(
pipe_grid, param_grid,
cv=5, scoring='r2',
return_train_score=True
)
search.fit(X_train, y_train)
print(f"Best alpha: {search.best_params_['model__alpha']}")
print(f"Best CV R-squared: {search.best_score_:.4f}")
print(f"Test R-squared: {search.score(X_test, y_test):.4f}")
# Show all results
print(f"\nAll combinations tested:")
for mean_score, std_score, params in zip(
search.cv_results_['mean_test_score'],
search.cv_results_['std_test_score'],
search.cv_results_['params']
):
print(f" alpha={params['model__alpha']:>7.3f} "
f"CV R-sq={mean_score:.4f} (+/- {std_score:.4f})")
GridSearchCV tries every combination in param_grid, evaluates each with K-fold cross-validation, and keeps the best. The final model is automatically refit on the full training set with the winning hyperparameters. When you call search.predict() or search.score(), you're using that best model.
Notice the step_name__parameter_name syntax with double underscores. That's how you target a specific step inside a pipeline. 'model__alpha' means "the alpha parameter of the step named model". You can tune parameters of any step this way -- including preprocessor parameters.
For larger grids, there's RandomizedSearchCV, which samples a fixed number of random combinations in stead of trying everything. If you have 5 parameters each with 10 values, the full grid is 100,000 combinations. That's going to take a while. RandomizedSearchCV might try 100 random combinations and find something 95% as good in a fraction of the time:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
# Randomized search with distributions instead of lists
param_distributions = {
'model__alpha': uniform(loc=0.001, scale=100), # uniform between 0.001 and 100.001
}
random_search = RandomizedSearchCV(
pipe_grid, param_distributions,
n_iter=20, # try 20 random combinations
cv=5, scoring='r2',
random_state=42
)
random_search.fit(X_train, y_train)
print(f"Best alpha (random search): {random_search.best_params_['model__alpha']:.4f}")
print(f"Best CV R-squared: {random_search.best_score_:.4f}")
The beauty of both approaches: they use cross-validation internally, so you don't need a separate validation set. The grid search IS your validation process. You still keep the test set untouched for the final evaluation -- same principle from episode #14.
Classification with sklearn
Everything we just covered works identically for classification. Same API, same patterns, same pipeline structure. Let me show you the full workflow -- from raw data to evaluation report -- using sklearn's classification tools:
# Medical screening dataset (callback to episode #13)
np.random.seed(42)
n = 500
biomarkers = np.random.randn(n, 5)
risk = (0.8 * biomarkers[:, 0] + 0.5 * biomarkers[:, 1]
- 0.3 * biomarkers[:, 2] + 0.2 * biomarkers[:, 3]
+ np.random.randn(n) * 0.8)
y_class = (risk > 0.5).astype(int)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
biomarkers, y_class, test_size=0.2, random_state=42, stratify=y_class
)
print(f"Dataset: {n} patients, 5 biomarkers")
print(f" Train: {len(y_train_c)} ({y_train_c.mean():.1%} high risk)")
print(f" Test: {len(y_test_c)} ({y_test_c.mean():.1%} high risk)")
Notice the stratify=y_class parameter in train_test_split. This is the stratified splitting we implemented by hand in episode #13 -- it ensures the class balance is preserved in both train and test sets. One parameter instead of 15 lines of code.
Now let's train and evaluate:
# Build a classification pipeline
clf_pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=1000)),
])
clf_pipe.fit(X_train_c, y_train_c)
y_pred_c = clf_pipe.predict(X_test_c)
# The full evaluation report -- one call!
print("\n--- Classification Report ---\n")
print(classification_report(y_test_c, y_pred_c,
target_names=['low risk', 'high risk']))
# Confusion matrix
cm = confusion_matrix(y_test_c, y_pred_c)
print(f"Confusion Matrix:")
print(f" Predicted")
print(f" Low High")
print(f" Actual Low {cm[0,0]:>4d} {cm[0,1]:>4d}")
print(f" Actual High {cm[1,0]:>4d} {cm[1,1]:>4d}")
Look at that classification_report output. Precision, recall, F1 for each class, support counts, weighted averages -- everything we computed manually in episode #13, in one call. The confusion matrix gives you the exact same four numbers (TP, FP, FN, TN) we built by hand.
Let's also grab the probabilities and compute AUC-ROC -- remember from episode #13, we need the raw scores, not just the class labels, to evaluate across all thresholds:
from sklearn.metrics import roc_auc_score
# predict_proba gives probabilities instead of hard class labels
y_proba_c = clf_pipe.predict_proba(X_test_c)[:, 1] # probability of class 1
auc = roc_auc_score(y_test_c, y_proba_c)
print(f"\nAUC-ROC: {auc:.4f}")
# Cross-validated classification with multiple metrics
from sklearn.model_selection import cross_validate
clf_cv_results = cross_validate(
clf_pipe, biomarkers, y_class,
cv=5, scoring=['accuracy', 'f1', 'roc_auc'],
)
print(f"\n5-Fold CV Results:")
print(f" Accuracy: {clf_cv_results['test_accuracy'].mean():.3f} "
f"+/- {clf_cv_results['test_accuracy'].std():.3f}")
print(f" F1: {clf_cv_results['test_f1'].mean():.3f} "
f"+/- {clf_cv_results['test_f1'].std():.3f}")
print(f" AUC-ROC: {clf_cv_results['test_roc_auc'].mean():.3f} "
f"+/- {clf_cv_results['test_roc_auc'].std():.3f}")
Every metric from episode #13 -- accuracy, F1, AUC-ROC -- available as a one-liner with cross-validation built in. And because clf_pipe includes StandardScaler, the scaling is re-fitted inside each fold. Zero leakage.
Model persistence: save and load
After training a model (or an entire pipeline), you want to save it to disk so you don't have to retrain every time. Imagine you've trained a model, tuned the hyperparameters, validated the performance -- you don't want to repeat all that work every time your script runs. joblib handles this:
import joblib
# Train a complete pipeline
final_pipe = Pipeline([
('scaler', StandardScaler()),
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('model', Ridge(alpha=1.0)),
])
final_pipe.fit(X_train, y_train)
print(f"Model trained. Test R-sq: {final_pipe.score(X_test, y_test):.4f}")
# Save the ENTIRE pipeline -- scaler params, poly config, model weights
joblib.dump(final_pipe, '/tmp/apartment_model.joblib')
print(f"Model saved to /tmp/apartment_model.joblib")
# Load it back (imagine this is a completely different script, days later)
loaded_pipe = joblib.load('/tmp/apartment_model.joblib')
loaded_preds = loaded_pipe.predict(X_test)
print(f"Loaded model R-sq: {loaded_pipe.score(X_test, y_test):.4f}")
print(f"Predictions match: {np.allclose(final_pipe.predict(X_test), loaded_preds)}")
The saved file contains the ENTIRE pipeline -- the scaler's mean and standard deviation, the polynomial feature structure, the Ridge model's weights and intercept. Load it anywhere and predict immediately. No retraining, no recomputing scaling statistics. This is how models move from "I trained something cool on my laptop" to "this is deployed in production."
One important note: the saved model is tied to the specific versions of sklearn and NumPy you used. If you train with sklearn 1.3 and try to load with sklearn 1.5, it might work or it might not. For production deployments, pin your package versions. For learning purposes, this is a detail you can worry about later.
Common sklearn gotchas
After using sklearn for quit some time, I've collected a few gotchas that trip up beginners (and honestly, some experienced people too). Let me save you the debugging time:
# Gotcha 1: fit() OVERWRITES previous training
model_1 = LinearRegression()
model_1.fit(X_train_s[:50], y_train[:50]) # trained on 50 samples
r2_first = model_1.score(X_test_s, y_test)
model_1.fit(X_train_s[:100], y_train[:100]) # OVERWRITES! Now trained on 100
r2_second = model_1.score(X_test_s, y_test)
print(f"Gotcha 1: fit() overwrites previous training")
print(f" After first fit (50 samples): R-sq = {r2_first:.4f}")
print(f" After second fit (100 samples): R-sq = {r2_second:.4f}")
print(f" (The first model is gone -- fit() mutates in place)\n")
That's a subtle one. If you want to compare models trained on different data, you need separate model objects. Calling fit() again doesn't create a new model -- it replaces the old one.
# Gotcha 2: random_state for reproducibility
from sklearn.ensemble import RandomForestRegressor
# Deterministic models (like LinearRegression) give same results every time
m_det = LinearRegression()
m_det.fit(X_train_s, y_train)
print(f"Gotcha 2: some models are stochastic")
print(f" LinearRegression is deterministic -- no random_state needed")
# Stochastic models (like RandomForest) give different results each time!
m_rand_a = RandomForestRegressor(n_estimators=10)
m_rand_a.fit(X_train_s, y_train)
r2_a = m_rand_a.score(X_test_s, y_test)
m_rand_b = RandomForestRegressor(n_estimators=10)
m_rand_b.fit(X_train_s, y_train)
r2_b = m_rand_b.score(X_test_s, y_test)
m_rand_c = RandomForestRegressor(n_estimators=10, random_state=42)
m_rand_c.fit(X_train_s, y_train)
r2_c = m_rand_c.score(X_test_s, y_test)
m_rand_d = RandomForestRegressor(n_estimators=10, random_state=42)
m_rand_d.fit(X_train_s, y_train)
r2_d = m_rand_d.score(X_test_s, y_test)
print(f" RandomForest without seed: {r2_a:.4f} vs {r2_b:.4f} (different!)")
print(f" RandomForest WITH seed=42: {r2_c:.4f} vs {r2_d:.4f} (identical)\n")
We haven't covered Random Forests yet (that's coming up in this series), but the point is general: any model with randomness in its training process needs random_state to be set if you want reproducible results. train_test_split has the same issue -- always pass random_state unless you intentionally want different splits each time.
# Gotcha 3: scoring parameter sign convention
from sklearn.model_selection import cross_val_score
# For regression, "higher is better" metrics work intuitively
r2_scores = cross_val_score(Ridge(alpha=1.0), X_train_s, y_train,
cv=5, scoring='r2')
print(f"Gotcha 3: scoring sign conventions")
print(f" R-squared (positive = good): {r2_scores.mean():.4f}")
# But MSE is "lower is better" -- sklearn NEGATES it!
mse_scores = cross_val_score(Ridge(alpha=1.0), X_train_s, y_train,
cv=5, scoring='neg_mean_squared_error')
print(f" neg_MSE (NEGATIVE values): {mse_scores.mean():.0f}")
print(f" Actual MSE (negate it back): {-mse_scores.mean():.0f}")
print(f" RMSE: {np.sqrt(-mse_scores.mean()):,.0f}\n")
This one catches EVERYONE at some point. You run cross_val_score with scoring='neg_mean_squared_error' and get large negative numbers. That's not a bug -- sklearn negates "lower is better" metrics so that its internal optimization (which always maximizes) works correctly. Just negate the result when reading it. Annoying convention, but once you know it, it's fine.
# Gotcha 4: feature count must match between fit() and predict()
scaler_gotcha = StandardScaler()
scaler_gotcha.fit(X_train_s) # fitted on data with 3 features
try:
# Trying to transform data with 5 features -- CRASH!
bad_data = np.random.randn(10, 5)
scaler_gotcha.transform(bad_data)
except ValueError as e:
print(f"Gotcha 4: feature count mismatch")
print(f" Error: {e}")
print(f" (Fitted on 3 features, tried to transform 5 -- sklearn catches this)")
This is sklearn protecting you from yourself. If you fit a scaler on 3 columns and then try to transform data with 5 columns, the mean and std arrays have the wrong shape. The error message is usually clear about what happened. This is one reason pipelines are so useful -- they handle the plumbing and ensure consistent feature shapes throughout.
What we built versus what sklearn gives us
Let me show you the complete mapping. Every single sklearn tool we used today corresponds to something we built from scratch in episodes #10-15:
| What we built (from scratch) | Sklearn equivalent | Episode |
|---|---|---|
| Manual gradient descent | LinearRegression().fit() | #10 |
| Normal equation solver | LinearRegression().fit() (default) | #11 |
| Ridge regression with L2 penalty | Ridge(alpha=...) | #11 |
| Lasso regression with L1 penalty | Lasso(alpha=...) | #11 |
| StandardScaler class | sklearn.preprocessing.StandardScaler | #11 |
| Logistic regression from scratch | LogisticRegression() | #12 |
confusion_matrix() function | sklearn.metrics.confusion_matrix | #13 |
| Precision, recall, F1 functions | classification_report() | #13 |
| AUC-ROC computation | roc_auc_score() | #13 |
| K-fold cross-validation loop | cross_val_score() | #13 |
| Stratified K-fold | StratifiedKFold / stratify= param | #13 |
| Manual train/test split | train_test_split() | #10, #14 |
PrepPipeline class (fit/transform) | Pipeline | #14 |
| One-hot encoding loop | OneHotEncoder | #14 |
| Polynomial features function | PolynomialFeatures | #15 |
| Grid search loop over alphas | GridSearchCV | #11 |
Same concepts. Same math. Same patterns. Professional implementation with edge case handling, documentation, and years of battle-testing by thousands of contributors. From here on, we'll use sklearn for standard operations and only go from-scratch when we need to understand something new that hasn't been covered yet.
A complete real-world workflow
Let me bring everything together in one realistic example. This is close to what a real ML project looks like with sklearn -- the kind of workflow you'd use at work or in a data science competition:
# A complete sklearn workflow: apartment price prediction
np.random.seed(42)
n = 500
# Generate realistic-ish apartment data
sqm = np.random.uniform(30, 150, n)
rooms = np.random.randint(1, 6, n).astype(float)
neighborhoods = np.random.choice(
['centrum', 'zuid', 'west', 'oost', 'noord'], n
)
age = np.random.uniform(0, 50, n)
floor = np.random.randint(0, 10, n).astype(float)
has_elevator = np.random.randint(0, 2, n).astype(float)
price = (2500 * sqm + 500 * rooms - 300 * age
- 4000 * floor * (1 - has_elevator)
+ np.random.randn(n) * 12000)
X_all = np.column_stack([sqm, rooms, neighborhoods, age, floor, has_elevator])
# Step 1: split FIRST (golden rule from episode #14)
X_tr, X_te, y_tr, y_te = train_test_split(
X_all, price, test_size=0.2, random_state=42
)
# Step 2: build the preprocessing + model pipeline
numeric_features = [0, 1, 3, 4, 5] # sqm, rooms, age, floor, elevator
cat_features = [2] # neighborhood
preprocessor = ColumnTransformer([
('num', Pipeline([
('scaler', StandardScaler()),
('poly', PolynomialFeatures(degree=2, include_bias=False)),
]), numeric_features),
('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
cat_features),
])
full_pipeline = Pipeline([
('preprocess', preprocessor),
('model', Ridge()),
])
# Step 3: hyperparameter tuning with cross-validation
param_grid = {
'preprocess__num__poly__degree': [1, 2],
'model__alpha': [0.01, 0.1, 1.0, 10.0],
}
grid = GridSearchCV(
full_pipeline, param_grid,
cv=5, scoring='r2',
return_train_score=True
)
grid.fit(X_tr, y_tr)
# Step 4: results
print("=== Apartment Price Prediction ===\n")
print(f"Best parameters: {grid.best_params_}")
print(f"Best CV R-squared: {grid.best_score_:.4f}")
print(f"Test R-squared: {grid.score(X_te, y_te):.4f}")
test_preds = grid.predict(X_te)
rmse = np.sqrt(mean_squared_error(y_te, test_preds))
mae = mean_absolute_error(y_te, test_preds)
print(f"Test RMSE: EUR {rmse:,.0f}")
print(f"Test MAE: EUR {mae:,.0f}")
# Show top 5 results from the grid
print(f"\nTop 5 parameter combinations:")
results_df = list(zip(
grid.cv_results_['mean_test_score'],
grid.cv_results_['std_test_score'],
grid.cv_results_['params']
))
results_df.sort(key=lambda x: -x[0])
for rank, (mean_s, std_s, params) in enumerate(results_df[:5], 1):
print(f" #{rank}: R-sq={mean_s:.4f} (+/- {std_s:.4f}) {params}")
Look at what that code does. Mixed numeric and categorical data, polynomial feature expansion, scaling, one-hot encoding, Ridge regression with hyperparameter tuning, cross-validation -- all in one clean pipeline. The GridSearchCV tries every combination of polynomial degree and regularization strength, evaluates each with 5-fold CV, and picks the best. The test set is only touched once at the end.
This is the workflow. Data in, pipeline, grid search, evaluate. Once you have this pattern down, you can apply it to any supervised learning problem. Swap in different models, add or remove preprocessing steps, change the parameter grid. The structure stays the same.
The shift: from building to using
Let me step back for a second and reflect on where we are in this series. Episodes #1-9 were about building intuition and understanding -- what ML is, how data works, what the math does. Episodes #10-15 were about building everything from scratch -- models, evaluation, data prep, feature engineering. Every function, every loop, every formula written by hand so you know exactly what happens when.
Now with episode #16, we've made the shift to using professional tools. And here's the thing -- this shift is NOT about abandoning what we built. It's about standing on it. You can read StandardScaler().fit_transform(X) and instantly know it's computing (X - mean) / std using the training statistics. You can see cross_val_score(model, X, y, cv=5) and know it's doing the exact K-fold loop we wrote in episode #13. You can debug a pipeline that gives weird results because you understand each step's internals.
From here, the series moves into new territory. We've covered linear models thorougly -- regression and classification, from scratch and with sklearn. The next part of the journey takes us into models that think very differently from linear regression. Models that make decisions by asking a series of yes/no questions about your data. Models that build collections of simpler models and combine their answers. Models that can discover complex nonlinear patterns without you having to engineer polynomial and interaction features by hand. The tools we set up today -- pipelines, cross-validation, grid search -- will carry us through all of them.
Let's recap
We traded our hand-built tools for professional-grade versions today, and the transition was smooth because we understood every tool we replaced. Here's what we covered:
- Scikit-learn's API is beautifully consistent:
fit()learns from data,predict()makes predictions,transform()preprocesses,fit_transform()does both. The same pattern on every model and preprocessor -- learn it once, use it everywhere; - Swapping models is trivial thanks to the consistent API. Linear, Ridge, Lasso, KNN -- same two lines of code, just change the class name. This lets you focus on which algorithm works best, not on reimplementation;
- Pipelines chain preprocessing and modeling into one leakage-proof object. Scaling, polynomial features, one-hot encoding, and the final model all fitted correctly during cross-validation. The
PrepPipelineconcept from episode #14, done right; - ColumnTransformer handles mixed data types within a pipeline -- numeric columns get scaled, categorical columns get one-hot encoded, all automatically.
handle_unknown='ignore'prevents crashes on unseen categories; - Cross-validation in one line:
cross_val_score()replaces our 40-line K-fold implementation from episode #13.cross_validate()gives you multiple metrics at once; - GridSearchCV automates hyperparameter tuning by trying every combination with cross-validation.
RandomizedSearchCVdoes the same for large search spaces. Both use the pipeline, so preprocessing is correct inside each fold; - Classification works identically:
LogisticRegression,classification_report,confusion_matrix,roc_auc_score-- everything from episode #13 in professional form.stratify=intrain_test_splithandles stratified splitting; - joblib saves entire pipelines (scaler + features + model) to disk for later use. Load it anywhere and predict immediately;
- Every sklearn tool maps directly to something you built from scratch in episodes #10-15. There are no black boxes -- you know what's under the hood.