Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
What will I learn
- You will learn how to apply everything from episodes #1-20 to a real, messy, end-to-end classification problem;
- collecting and preparing historical price and volume data with pandas;
- engineering features for financial time series -- returns, volatility, moving averages, momentum indicators;
- why random train/test splits will LIE to you on time series data and how walk-forward validation fixes it;
- comparing logistic regression, decision trees, random forests, gradient boosting, and SVMs head-to-head on the same problem;
- inspecting feature importances to check whether the model learned something sensible;
- interpreting modest results honestly and resisting the temptation to inflate numbers through leaky validation;
- building a complete prediction pipeline from raw data to evaluated, compared, and understood models.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes (this post)
Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
Twenty episodes in. We've built linear regression from scratch (episode #10), logistic regression from scratch (#12), decision trees from scratch (#17), AdaBoost from scratch (#19), and we've used scikit-learn to wrap all of them into clean, reproducible pipelines (#16). We've talked about evaluation (#13), data preparation (#14), feature engineering (#15), random forests (#18), gradient boosting (#19), and SVMs (#20). That's a LOT of concepts, a lot of code, and a lot of isolated examples.
Time to put it ALL together.
This episode is different from everything that came before. Less teaching. More doing. We're building a complete ML pipeline from raw data to evaluated model, using a real-world problem that's genuinely hard: predicting crypto market regimes. The goal isn't to build a money-printing trading bot (spoiler: if a tutorial teaches you to do that, they're lying). The goal is to experience the full workflow -- data preparation, feature engineering, model selection, validation, interpretation -- on a problem where the noise is real, the data has traps, and honest results look modest.
Let's dive right in.
The problem: what are we predicting?
"Market regime" sounds fancy, but we need a concrete, computable definition before writing any code. You can't feed "vibes" into model.fit() ;-)
We'll define three regimes based on the 14-day forward return:
- Bullish: the price 14 days from now is more than 5% higher than today
- Bearish: the price 14 days from now is more than 5% lower than today
- Ranging: everything in between (-5% to +5%)
This gives us a three-class classification problem. Remember episode #12 where we built logistic regression for binary classification (two classes)? Same idea, but with three possible outcomes. Scikit-learn handles multi-class naturally (remember classification_report from episode #13 printing precision and recall per class?).
The 5% threshold and 14-day horizon are choices. In a production setting you'd experiment with different thresholds and horizons. For this mini project, these values produce reasonably balanced classes across most market histories -- not perfectly balanced, but not wildly skewed either. Class imbalance is something we discussed in episode #13 when we talked about why accuracy alone can be misleading, and it matters here too.
Getting the data
We'll work with simulated daily price and volume data that captures the key statistical properties of real crypto markets. In a real project you'd pull OHLCV (Open, High, Low, Close, Volume) from a public exchange API -- CoinGecko, Binance, whatever -- but simulated data gives us reproducibility AND lets us inject known regime changes so we can verify the pipeline is working.
import numpy as np
import pandas as pd
np.random.seed(42)
n_days = 1095 # 3 years of daily data
# Daily log returns: small positive drift, high volatility
returns = np.random.normal(0.0005, 0.03, n_days)
# Inject regime changes -- distinct periods where the market
# character shifts. This is what makes it interesting.
for start, shift in [(100, 0.005), (300, -0.004), (500, 0.006),
(700, -0.005), (900, 0.004)]:
returns[start:start + 100] += shift
# Convert log returns to a price series starting at 40000
price = 40000 * np.exp(np.cumsum(returns))
# Volume correlates with absolute returns (volatility drives activity)
volume = np.abs(np.random.normal(1e9, 3e8, n_days))
volume *= (1 + np.abs(returns) * 10)
# High/Low spread around close
high = price * (1 + np.abs(np.random.normal(0, 0.015, n_days)))
low = price * (1 - np.abs(np.random.normal(0, 0.015, n_days)))
df = pd.DataFrame({
'date': pd.date_range('2022-01-01', periods=n_days, freq='D'),
'close': price,
'high': high,
'low': low,
'volume': volume,
}).set_index('date')
print(f"Data shape: {df.shape}")
print(f"Date range: {df.index[0].date()} to {df.index[-1].date()}")
print(f"Price range: {df['close'].min():,.0f} to {df['close'].max():,.0f}")
Why simulate in stead of using real data? A few reasons. First, reproducibility -- everyone reading this gets exactly the same numbers. Second, our simulation captures the properties that make real financial data challenging: fat-tailed returns (extreme moves happen more often than a normal distribution predicts), volatility clustering (calm periods followed by turbulent ones), and distinct regime changes where the market character shifts from trending to ranging and back. These are the same properties that make real prediction hard. Our pipeline gets tested on realistic patterns without needing an API key or worrying about stale data.
Feature engineering: turning raw prices into useful signals
This is where episode #15 really pays off. Raw OHLCV data is NOT useful by itself for a model. The raw price being 40,000 versus 80,000 tells the model nothing about where it's headed. What the model needs are derived features that capture the dynamics: how fast the price is moving, how volatile it's been, whether it's stretched above or below its average, whether volume is unusual. These are the signals -- the raw numbers are just the raw material.
def engineer_features(df):
"""Build feature set from OHLCV data.
CRITICAL: every feature uses ONLY past data.
No future peeking. No centered moving averages.
"""
d = df.copy()
daily_ret = d['close'].pct_change()
# Multi-horizon returns (momentum features)
for period in [1, 3, 7, 14, 30]:
d[f'return_{period}d'] = d['close'].pct_change(period)
# Rolling volatility at different windows
for window in [7, 14, 30]:
d[f'volatility_{window}d'] = daily_ret.rolling(window).std()
# Price relative to moving averages
for window in [7, 20, 50]:
sma = d['close'].rolling(window).mean()
d[f'price_to_sma_{window}'] = d['close'] / sma
# Volume features
vol_sma_20 = d['volume'].rolling(20).mean()
d['volume_ratio'] = d['volume'] / vol_sma_20
# Daily range (high-low spread as fraction of close)
d['daily_range'] = (d['high'] - d['low']) / d['close']
d['range_sma_14'] = d['daily_range'].rolling(14).mean()
# RSI (Relative Strength Index) -- classic momentum oscillator
delta = d['close'].diff()
gain = delta.clip(lower=0).rolling(14).mean()
loss = (-delta.clip(upper=0)).rolling(14).mean()
d['rsi'] = 100 - (100 / (1 + gain / loss))
return d
df = engineer_features(df)
print(f"Features generated: {len(df.columns)} columns total")
Let me walk through the five categories of features we just built, because understanding WHY each one exists matters more than the code itself.
Returns at multiple horizons (1, 3, 7, 14, 30 days) capture momentum -- where the price has been recently. A big positive 7-day return means the price has been climbing. A negative 30-day return means a longer-term downtrend. The model gets to see momentum at multiple timescales, not just one. This is the kind of multi-scale thinking we first saw in episode #15 when we discussed creating features at different granularities.
Volatility (rolling standard deviation of daily returns) captures how turbulent the market has been. Regime changes -- the exact thing we're trying to predict -- often coincide with volatility spikes. A market transitioning from calm sideways movement to a panicked selloff shows a sharp jump in volatility BEFORE the full extent of the move is visible in the return features.
Price-to-moving-average ratios tell the model whether the price is stretched above or below its recent average. If the price is 10% above its 50-day moving average, that could mean strong momentum (bullish) or overextension (correction coming). The model gets to learn which interpretation fits better.
Volume ratio reveals unusual trading activity relative to the 20-day norm. Big moves on high volume are more likely to be real regime changes than big moves on normal volume.
RSI (Relative Strength Index) is a classic momentum oscillator that runs from 0 to 100. Below 30 is considered "oversold" (bearish exhaustion), above 70 is "overbought" (bullish exhaustion). Whether those textbook thresholds actually work is for the model to figure out ;-)
And here's the part that REALLY matters. Notice that every single feature uses only past data. Rolling windows look backwards. .pct_change() compares today to a past date. .rolling(14).mean() uses the previous 14 values. There are no centered moving averages, no features that peek into the future. This is the "cardinal sin of data leakage" we talked about in episode #14 -- if a feature uses future information, the model looks brilliant in backtesting and then falls on its face in production. Our features are clean.
Creating the target variable
The target is straightforward: compute the 14-day forward return for each row, then bucket it into one of our three classes.
# Forward return: what WILL happen in the next 14 days
df['forward_return'] = df['close'].shift(-14) / df['close'] - 1
# Map to classes: 0=bearish, 1=ranging, 2=bullish
df['regime'] = df['forward_return'].apply(
lambda r: 2 if r > 0.05 else (0 if r < -0.05 else 1)
)
# Drop rows with NaN (from rolling windows and forward shift)
df = df.dropna()
# Select feature columns -- exclude raw prices and target-related columns
feature_cols = [c for c in df.columns
if c not in ['close', 'high', 'low', 'volume',
'forward_return', 'regime']
and 'sma_7' not in c and 'sma_20' not in c
and 'sma_50' not in c and 'volume_sma' not in c.lower()]
X = df[feature_cols].values
y = df['regime'].values
print(f"Features: {len(feature_cols)}")
print(f"Samples: {len(X)}")
print(f"Class distribution: {dict(zip(*np.unique(y, return_counts=True)))}")
print(f" 0=bearish, 1=ranging, 2=bullish")
Notice what we're NOT including as features. Raw prices are excluded. A close price of 40,000 versus 80,000 carries no signal about the future regime -- what matters is the CHANGE and the RATIO, not the absolute level. Features should be stationary (their statistical properties shouldn't change over time) wherever possible. Raw prices trend upward or downward; returns and ratios fluctuate around relatively stable distributions. This is a subtle but important point that separates working financial ML from broken financial ML.
Walk-forward validation: don't lie to yourself
Here's where many ML projects on time series data go catastrophically wrong, and I can't emphasize this enough. If you use random train_test_split or random k-fold cross-validation on time series data, you are letting the model see the future. Data from December 2023 might end up in the training set while data from July 2023 (earlier!) is in the test set. Since nearby time points are correlated (yesterday's volatility is similar to today's), the model effectively cheats -- it has information about the test period leaking through the training set.
We covered this in episode #14 when we discussed the three-way split. For time series, the correct approach is walk-forward validation: always train on the past, always test on the future.
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
print("Walk-forward validation splits:")
print(f"{'Fold':>6s} {'Train start':>12s} {'Train end':>12s} "
f"{'Test start':>12s} {'Test end':>12s} {'Train':>6s} {'Test':>5s}")
print("-" * 78)
for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
print(f"{i+1:>6d} "
f"{str(df.index[train_idx[0]].date()):>12s} "
f"{str(df.index[train_idx[-1]].date()):>12s} "
f"{str(df.index[test_idx[0]].date()):>12s} "
f"{str(df.index[test_idx[-1]].date()):>12s} "
f"{len(train_idx):>6d} {len(test_idx):>5d}")
Each fold trains on all data up to a point and tests on the next chunk. The training window grows with each fold, but the test data is ALWAYS strictly in the future relative to the training data. This mimics how the model would actually be deployed -- you train on everything you have, then predict tomorrow. You never get to peek at tomorrow while training.
But why not just use one static train/test split? Because one split gives you one noisy estimate of performance. The walk-forward splits give you five estimates across different market conditions. Maybe the model works well on the 2022 test period but terribly on the 2023 test period. You want to know that. Cross-validation (episode #13) taught us this lesson for i.i.d. data -- the time-series version applies the same principle.
The model bake-off: every classifier we've built
Here we go. The fun part. We'll compare all five families of classifiers we've covered across episodes #12-20. Same data, same features, same validation, same metric. A fair fight:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score
models = {
'LogReg': Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=1000, random_state=42))
]),
'DecTree': DecisionTreeClassifier(max_depth=5, random_state=42),
'RF': RandomForestClassifier(
n_estimators=200, max_depth=8, random_state=42),
'GBM': GradientBoostingClassifier(
n_estimators=200, max_depth=4,
learning_rate=0.1, random_state=42),
'SVM': Pipeline([
('scaler', StandardScaler()),
('model', SVC(kernel='rbf', C=10, random_state=42))
]),
}
results = {}
tscv = TimeSeriesSplit(n_splits=5)
print(f"{'Model':>8s} {'F1 mean':>8s} {'F1 std':>8s} {'Scores per fold'}")
print("-" * 65)
for name, model in models.items():
fold_scores = []
for train_idx, test_idx in tscv.split(X):
model.fit(X[train_idx], y[train_idx])
preds = model.predict(X[test_idx])
fold_scores.append(f1_score(y[test_idx], preds, average='macro'))
mean_f1 = np.mean(fold_scores)
std_f1 = np.std(fold_scores)
results[name] = (mean_f1, std_f1)
scores_str = " ".join(f"{s:.3f}" for s in fold_scores)
print(f"{name:>8s} {mean_f1:>8.3f} {std_f1:>8.3f} [{scores_str}]")
A few things to notice about this code.
We use macro F1 score in stead of accuracy. Remember from episode #13 -- accuracy can be deeply misleading when classes are imbalanced. If "ranging" dominates the dataset (which it often does in real markets -- most days the price doesn't move 5% in either direction), a dumb model that ALWAYS predicts "ranging" would score high on accuracy while being completely useless for identifying the bull and bear regimes we actually care about. Macro F1 averages the F1 across all three classes equally, so the model has to be decent at predicting ALL of them, not just the majority class.
Logistic regression and SVMs get StandardScaler in a Pipeline (episodes #16 and #20). Tree-based methods don't need it -- they split on thresholds and are scale-invariant. Same lesson we've been reinforcing since episode #14. The Pipeline ensures the scaler is fitted on training data only and applied consistently to test data -- no data leakage from the scaler either.
The five models represent the full spectrum of Arc 1. Logistic regression is our linear baseline -- if it does well, the problem might not need complex models. The decision tree shows what a single tree can do (usually not great, as we saw in episode #17 -- high variance, brittle). Random forests and gradient boosting are our heavy hitters for tabular data (episodes #18 and #19). And the SVM tests whether the geometric maximum-margin approach from episode #20 adds anything on this particular problem.
In my experience on financial data, gradient boosting or random forests typically win, but the margin over logistic regression is often suprisingly small. That tells you something about the inherent difficulty of the problem. When even the fanciest model can barely beat a linear baseline, the signal-to-noise ratio is just genuinely low.
Analyzing the winner
from sklearn.metrics import classification_report
# Identify best model
best_name = max(results, key=lambda k: results[k][0])
best_f1, best_std = results[best_name]
print(f"Best model: {best_name} (F1={best_f1:.3f} +/- {best_std:.3f})")
# Retrain on the last fold for detailed analysis
best_model = models[best_name]
train_idx, test_idx = list(tscv.split(X))[-1]
best_model.fit(X[train_idx], y[train_idx])
y_pred = best_model.predict(X[test_idx])
print(f"\nDetailed results on last fold "
f"({df.index[test_idx[0]].date()} to "
f"{df.index[test_idx[-1]].date()}):\n")
print(classification_report(
y[test_idx], y_pred,
target_names=['Bearish', 'Ranging', 'Bullish']
))
Don't be surprised -- or disappointed -- if the results are modest. A macro F1 of 0.40-0.50 on three-class crypto regime prediction is realistic and honestly pretty respectable. If you see anything above 0.60, you should immediately get suspicious and check for data leakage before celebrating. Seriously -- that's not pessimism, that's experience. Financial markets are noisy. The signal-to-noise ratio is terrible. The fact that your model does better than random (which would give roughly 0.33 macro F1 on three balanced classes) means it's finding SOMETHING real in the features. But don't expect anything close to the 90%+ accuracy we got on the apartment dataset in episodes #18 and #19. That data had a clean, strong, knowable relationship. Markets... don't.
Feature importance: what did the model actually learn?
If the best model is tree-based (which it probably will be), we can inspect feature importances. This is our sanity check -- if the top features don't make financial sense, something's wrong with the pipeline. We built this exact technique in episodes #15 and #18.
# Get the actual model object (might be inside a Pipeline)
if hasattr(best_model, 'feature_importances_'):
importances = best_model.feature_importances_
elif hasattr(best_model, 'named_steps'):
inner = best_model.named_steps.get('model', best_model)
if hasattr(inner, 'feature_importances_'):
importances = inner.feature_importances_
else:
importances = None
else:
importances = None
if importances is not None:
sorted_idx = np.argsort(importances)[::-1]
print("Top 10 most important features:")
for rank, i in enumerate(sorted_idx[:10], 1):
bar = "#" * int(importances[i] * 50)
print(f" {rank:>2d}. {feature_cols[i]:>25s}: "
f"{importances[i]:.3f} {bar}")
else:
print("(Best model doesn't expose feature importances directly)")
What should we expect to see at the top? Momentum features (multi-day returns, especially the 14-day and 30-day returns) should rank high -- because we're predicting a 14-day forward return, and momentum tends to persist in financial data. Volatility should be up there too, because regime changes are closely tied to shifts in volatility. RSI and volume ratio will probably contribute but be secondary.
If you see something weird like "daily_range" as the single top feature with 60% importance, that would be a red flag for potential overfitting or a data quirk. The model should learn what makes financial sense. If it doesn't, you need to investigate. This is why we do feature importance checks -- it's not just a nice-to-have, it's part of responsible model building.
Can we do better? Let's try tuning the winner
Rather than accepting the default hyperparameters, let's use the GridSearchCV framework from episode #16. But with a twist -- we'll use TimeSeriesSplit as our cross-validator so the grid search respects the time ordering:
from sklearn.model_selection import GridSearchCV
# Tune the gradient boosting model specifically
gb_pipe = Pipeline([
('model', GradientBoostingClassifier(random_state=42))
])
param_grid = {
'model__n_estimators': [100, 200, 300],
'model__max_depth': [3, 4, 5],
'model__learning_rate': [0.05, 0.1, 0.2],
'model__subsample': [0.8, 1.0],
}
grid = GridSearchCV(
gb_pipe, param_grid,
cv=TimeSeriesSplit(n_splits=3),
scoring='f1_macro',
n_jobs=-1
)
grid.fit(X, y)
print(f"Best parameters: {grid.best_params_}")
print(f"Best CV macro F1: {grid.best_score_:.3f}")
This is the staged tuning approach we discussed in episode #19. We're searching learning rate, tree count, depth, and subsampling simultaneously. The TimeSeriesSplit(n_splits=3) inside GridSearchCV ensures that even during the hyperparameter search, we never train on future data. That detail is easy to miss -- if you used the default k-fold here, your tuning would be contaminated even though your final evaluation used walk-forward splits.
Having said that, I should warn you: tuning often doesn't help much on financial data. The problem is inherently noisy. You might squeeze out an extra 0.01-0.02 in F1, which is real but modest. The biggest improvements on this kind of data almost always come from better features, not better hyperparameters. If you have time to invest, invest it in feature engineering (episode #15), not in searching a bigger parameter grid.
Simulating what would happen in practice
Let's do something concrete. The last fold of our walk-forward validation covers a specific date range. Let's look at the model's predictions day by day and see how they compare to what actually happened:
# Retrain best model on last training fold
train_idx, test_idx = list(TimeSeriesSplit(n_splits=5).split(X))[-1]
best_model = models[best_name]
best_model.fit(X[train_idx], y[train_idx])
preds = best_model.predict(X[test_idx])
test_dates = df.index[test_idx]
test_actual = y[test_idx]
regime_names = {0: 'BEAR', 1: 'RANGE', 2: 'BULL'}
# Show a sample of predictions vs actuals
print(f"{'Date':>12s} {'Predicted':>10s} {'Actual':>10s} {'Correct':>8s}")
print("-" * 46)
# Show first 20 predictions
for i in range(min(20, len(preds))):
pred_label = regime_names[preds[i]]
actual_label = regime_names[test_actual[i]]
correct = "YES" if preds[i] == test_actual[i] else "no"
print(f"{str(test_dates[i].date()):>12s} {pred_label:>10s} "
f"{actual_label:>10s} {correct:>8s}")
# Overall accuracy breakdown
for regime_id, regime_label in regime_names.items():
mask = test_actual == regime_id
if mask.sum() > 0:
acc = (preds[mask] == test_actual[mask]).mean()
print(f"\n{regime_label}: {acc:.1%} correct "
f"({mask.sum()} samples)")
Looking at individual predictions grounds the abstract metrics in reality. You'll see the model getting some days right and others wrong. You'll see streaks of correct predictions during strong trend periods and confusion during regime transitions. That's normal. Financial data is noisy. A model with 45% macro F1 is finding real patterns -- it's just operating in an environment where perfection is mathematically impossible because the future is genuinely uncertain.
The honest debrief: what building a complete pipeline teaches you
This project pulls together everything from episodes #1 through #20, and it reveals things that isolated textbook examples simply can't:
Data is the hard part. Count the lines of code we spent on feature engineering versus actual model training. It's not even close. Feature engineering, data cleaning, and validation setup dominated this project. That's normal. That's what episode #14 warned you about -- 80% of real ML work is data work. The model itself is a few lines of scikit-learn.
Model selection matters less than you think. The performance difference between the best and worst model (excluding maybe the single decision tree) is probably pretty small on this data. Maybe 5 percentage points of F1. The difference between good features and bad features is much bigger. If I had to choose between a logistic regression with brilliant features or a gradient boosting model with mediocre features, I'd take the logistic regression every time.
Time-series validation is non-negotiable. If you had used random k-fold cross-validation on this same data, every single model would have scored 10-20% higher. And those scores would be lies. The model would have seen correlated data from the future during training and appeared far better than it actually is. In a real trading context, you'd deploy it with confidence, it would lose money, and you'd have no idea why. Walk-forward validation prevents that specific kind of self-deception.
Modest results are honest results. A model that honestly predicts market regimes with 45% macro F1 is infinitely more valuable than one that claims 85% through leaky validation. The honest model can be improved -- add better features, try different target definitions, extend the training window. The dishonest one will fail silently in production. Always prefer honesty over flattering numbers.
What we used from every episode
I want to be explicit about the connections, because this project is a synthesis:
- Episode #3 (data representation): prices and volumes as numeric arrays in DataFrames
- Episode #13 (evaluation): F1 score, classification_report, understanding when accuracy lies
- Episode #14 (data preparation): handling NaN from rolling windows, avoiding data leakage, train/test philosophy
- Episode #15 (feature engineering): building derived features from raw data, checking feature importances
- Episode #16 (scikit-learn): Pipeline, GridSearchCV, consistent API across all five model families
- Episode #17 (decision trees): single tree as a baseline, understanding overfitting
- Episode #18 (random forests): robust ensemble, feature importances, no scaling needed
- Episode #19 (gradient boosting): sequential correction, learning rate tuning, staged hyperparameter search
- Episode #20 (SVMs): kernel trick, mandatory scaling, Pipeline integration
That's the foundation. Twenty episodes of building blocks, and now you've seen them assembled into something complete. Not perfect -- the results are modest, as they should be on a genuinely hard problem. But complete. And that completeness is what matters going forward.
What comes next
This wraps up what I'd call the "supervised learning" arc of this series. Every model we've built so far has followed the same pattern: give the algorithm labeled data (features + correct answers), and it learns to predict labels for new data. Bullish/bearish/ranging. Apartment prices. Cancer versus benign.
But there's a whole other world of machine learning where you DON'T have labels. You have data -- maybe lots of it -- but no one has marked the "correct answer." And you want to discover structure: which data points naturally group together? Are there clusters? Are there hidden patterns? Can we reduce 50 features down to 2 that capture most of the information? That's unsupervised learning, and it thinks about data in a fundamentally different way.
The tools we've built still apply -- scikit-learn pipelines, evaluation thinking, feature engineering. But the questions change. In stead of "how accurately does the model predict the right label?", the question becomes "does the structure the model found actually mean something useful?" That's a harder question, and a more interesting one.
Let's recap
We built a complete ML pipeline from scratch, touching every major concept from Arc 1:
- Market regime prediction is a three-class classification problem (bullish, bearish, ranging) defined by thresholding 14-day forward returns;
- Feature engineering for financial data uses returns at multiple horizons, rolling volatility, price-to-moving-average ratios, volume patterns, and momentum indicators. Every feature MUST use only past data -- forward-looking features are data leakage, the cardinal sin of time-series ML (episode #14);
- Walk-forward validation (
TimeSeriesSplit) is mandatory for time series. Random splits produce inflated scores because correlated time points leak information between train and test. This is the time-series version of the evaluation principles from episode #13; - We compared all five algorithm families from episodes #12-20: logistic regression, decision trees, random forests, gradient boosting, and SVMs. The tree-based ensembles typically win on tabular financial data, but the margin over simpler models is often small -- which tells you the signal-to-noise ratio is the real bottleneck, not the model;
- Feature importance serves as a sanity check. If the top features don't make domain sense, something is wrong with your pipeline. We used the same techniques from episodes #15 and #18;
- Honest, modest results on genuinely hard problems are MORE valuable than inflated numbers from leaky validation. A model with 45% macro F1 that you can trust is better than a model with 85% that was evaluated wrong.