Learn Python Series (#55) - Advanced Data Wrangling with Pandas

Repository

https://github.com/realScipio/learn-python-series

What will I learn

You will learn how window functions let you compute rolling, expanding, and grouped calculations;
how MultiIndex gives you hierarchical indexing for complex datasets;
time series resampling and shifting for temporal analysis;
the apply() escape hatch and why you should avoid it when possible;
advanced merge patterns including merge_asof for inexact time-based joins;
real-world data wrangling patterns that go beyond basic filtering and groupby.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
Familiarity with Pandas basics (episodes #30-33) and modern Pandas (#54);
The ambition to learn Python programming.

Difficulty

Intermediate, advanced

Curriculum (of the `Learn Python Series`):

GitHub Account

https://github.com/realScipio

Learn Python Series (#55) - Advanced Data Wrangling with Pandas

In episodes #30-33 we learned to load, filter, group, and merge DataFrames. Last episode we caught up with modern Pandas features - copy-on-write, Arrow backend, nullable dtypes, method chaining. Now it's time for the techniques that separate "I know Pandas" from "I can actually wrangle real-world data."

The basics get you 60% of the way. Groupby, merge, filter - those handle the clean, well-structured cases. But real data isn't clean. It has irregular timestamps, hierarchical dimensions, temporal patterns that need sliding windows, and datasets at different frequencies that need to be aligned. That's what this episode is about.

Nota bene: Data wrangling is where you spend 80% of your time in any data project. The actual analysis or model is often trivial once the data is shaped correctly. Mastering these techniques is what makes that 80% productive instead of painful.

Window Functions - Rolling, Expanding, and Grouped

Window functions compute values across a sliding window of rows - moving averages, running totals, rolling standard deviations. We touched on .rolling() briefly in episode #54, but windows go much deeper than a simple moving average.

Rolling Windows

A rolling window looks backward from each row:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'date': pd.date_range('2025-01-01', periods=100, freq='D'),
    'price': np.random.randn(100).cumsum() + 100,
    'volume': np.random.randint(1000, 5000, 100),
})
df = df.set_index('date')

# 7-day rolling mean
df['ma_7'] = df['price'].rolling(7).mean()

# Rolling with minimum periods - don't wait for full window
df['ma_7_partial'] = df['price'].rolling(7, min_periods=1).mean()

# Rolling standard deviation (volatility proxy)
df['volatility_30'] = df['price'].rolling(30).std()

# Exponentially weighted - recent values matter more
df['ema_7'] = df['price'].ewm(span=7).mean()

The min_periods parameter is important in practice. Without it, the first 6 rows in a 7-day rolling window return NaN - you lose data at the start of your series. With min_periods=1, you get a partial average from whatever data is available. Whether that's appropriate depends on your use case: for financial analysis, you usually want the full window; for dashboards and reports, partial windows are often fine.

Exponentially weighted moving averages (ewm) are worth knowing because they respond faster to recent changes than simple moving averages. The span parameter controls how quickly old data fades - span=7 means data 7 periods ago has about 37% of the weight of the current observation.

Expanding Windows

Expanding windows grow from the start of the series - every row includes all data from the beginning up to that point:

# Cumulative mean (running average)
df['cumulative_avg'] = df['price'].expanding().mean()

# All-time high (running maximum)
df['all_time_high'] = df['price'].expanding().max()

# Cumulative return since start
df['cum_return'] = (df['price'] / df['price'].iloc[0] - 1) * 100

Expanding windows are the right choice for cumulative metrics - year-to-date averages, running records, or any calculation where the "window" is "everything up to now."

Grouped Rolling - The Real Power

This is where things get interesting. Apply rolling calculations within groups:

trades = pd.DataFrame({
    'date': pd.date_range('2025-01-01', periods=200, freq='D').tolist() * 2,
    'symbol': ['BTC'] * 200 + ['ETH'] * 200,
    'price': np.random.randn(400).cumsum() + 100,
})
trades = trades.sort_values(['symbol', 'date'])

# 7-day rolling mean PER symbol - independently for BTC and ETH
trades['ma_7'] = (
    trades
    .groupby('symbol')['price']
    .transform(lambda x: x.rolling(7, min_periods=1).mean())
)

The .transform() method is crucial here - it returns a Series with the same index as the input, so you can assign it back as a column. .apply() on a GroupBy can return anything (a scalar, a Series of different length, a DataFrame), which makes alignment unpredictable. .transform() guarantees index alignment.

Custom Window Functions

When built-in aggregations aren't enough, .rolling().apply() lets you run arbitrary Python:

# Percentage of days in the window where price increased
def pct_up_days(window):
    if len(window) < 2:
        return np.nan
    changes = np.diff(window)
    return (changes > 0).sum() / len(changes) * 100

df['pct_up_7d'] = df['price'].rolling(7).apply(pct_up_days, raw=True)

The raw=True parameter passes the window as a NumPy array instead of a Pandas Series - this is significantly faster because it skips Series construction overhead. Always use raw=True unless your function needs index information.

MultiIndex - Hierarchical Indexing

When your data has natural hierarchies - exchange/pair, country/city, year/quarter/month - MultiIndex lets you work with them explicitly instead of flattening everything into separate columns:

trades = pd.DataFrame({
    'exchange': ['Binance', 'Binance', 'Kraken', 'Kraken', 'Binance', 'Kraken'],
    'pair': ['BTC/USDT', 'ETH/USDT', 'BTC/USDT', 'ETH/USDT', 'BTC/USDT', 'BTC/USDT'],
    'volume': [100, 50, 80, 40, 110, 90],
    'price': [50000, 3000, 50100, 3010, 50200, 50150],
})

multi = trades.set_index(['exchange', 'pair'])
print(multi)

                     volume  price
exchange pair
Binance  BTC/USDT      100  50000
         ETH/USDT       50   3000
Kraken   BTC/USDT       80  50100
         ETH/USDT       40   3010
Binance  BTC/USDT      110  50200
Kraken   BTC/USDT       90  50150

Selecting Data from a MultiIndex

# All Binance rows
multi.loc['Binance']

# Specific exchange + pair
multi.loc[('Binance', 'BTC/USDT')]

# Cross-section: all BTC/USDT regardless of exchange
multi.xs('BTC/USDT', level='pair')

# Slice range on first level
multi.loc['Binance':'Kraken']

The .xs() (cross-section) method is the clean way to select on a non-first level. Without it, you'd need to use multi.loc[pd.IndexSlice[:, 'BTC/USDT'], :] - which works but is harder to read.

Pivot Tables and Reshaping

Pivot tables create MultiIndex naturally:

pivot = trades.pivot_table(
    values=['volume', 'price'],
    index='exchange',
    columns='pair',
    aggfunc={'volume': 'sum', 'price': 'mean'}
)

# Access: (outer column, inner column)
pivot[('volume', 'BTC/USDT')]

Stacking and unstacking reshape between long and wide formats - operations that come up constantly when preparing data for analysis or visualization:

# Wide to long (move column levels to index)
long = pivot.stack()

# Long to wide (move index levels to columns)
wide = long.unstack()

# Flatten a MultiIndex column header
pivot.columns = ['_'.join(col).strip() for col in pivot.columns.values]
# Now: 'volume_BTC/USDT', 'volume_ETH/USDT', 'price_BTC/USDT', 'price_ETH/USDT'

The column-flattening trick is essential when you need to export data or use it with tools that don't understand MultiIndex (plotting libraries, CSV exports, database inserts).

Time Series - Resampling and Shifting

Pandas was built for time series. Once your index is a DatetimeIndex, you unlock temporal operations that would be painful to implement manually.

Resampling - Changing Frequency

# Daily to monthly (downsampling - aggregate)
monthly = df.resample('ME').agg(
    open=('price', 'first'),
    high=('price', 'max'),
    low=('price', 'min'),
    close=('price', 'last'),
    total_volume=('volume', 'sum'),
    avg_volume=('volume', 'mean'),
)

# Daily to weekly
weekly = df.resample('W-MON').mean()  # Week starts Monday

# Monthly to daily (upsampling - interpolate)
daily_from_monthly = monthly['close'].resample('D').interpolate(method='linear')

The resampling frequency strings are worth memorizing: 'D' (day), 'W' (week), 'ME' (month-end), 'MS' (month-start), 'QE' (quarter-end), 'YE' (year-end), 'h' (hour), 'min' (minute). The ME/MS distinction matters - 'ME' labels each group with the last day of the month, 'MS' with the first.

Shifting - Temporal Comparisons

# Previous day's price
df['prev_close'] = df['price'].shift(1)

# Daily return (percentage change)
df['daily_return'] = df['price'].pct_change()

# 7-day return
df['weekly_return'] = df['price'].pct_change(7)

# Difference from 30 days ago (absolute, not percentage)
df['price_change_30d'] = df['price'].diff(30)

# Next day's price (negative shift = look forward)
df['next_close'] = df['price'].shift(-1)

Forward-looking shifts (shift(-1)) are used for creating target variables in predictive modeling - "given today's features, predict tomorrow's price." Be careful not to accidentally use future data in your features (this is called data leakage and it makes your models look great in backtesting and terrible in production).

Time Zone Handling

# Localize naive timestamps to UTC
df.index = df.index.tz_localize('UTC')

# Convert to another timezone
df.index = df.index.tz_convert('Europe/Amsterdam')

# Create timezone-aware range directly
idx = pd.date_range('2025-01-01', periods=100, freq='h', tz='US/Eastern')

Always store timestamps in UTC internally and convert to local time only for display. This avoids daylight saving time bugs (clocks jumping forward/backward create duplicate or missing hours), ambiguous timestamps during DST transitions, and confusion when combining data from different geographic sources.

Filling Gaps in Time Series

Real-world time series often have missing dates (weekends for stock data, downtime for sensors). Reindex to fill the gaps:

# Create complete date range and reindex
full_range = pd.date_range(df.index.min(), df.index.max(), freq='D')
df = df.reindex(full_range)

# Fill gaps with various strategies
df['price_ffill'] = df['price'].ffill()         # Forward-fill: use last known value
df['price_interp'] = df['price'].interpolate()   # Linear interpolation
df['volume_zero'] = df['volume'].fillna(0)       # Fill with zero (no trades = zero volume)

The choice of fill strategy depends on what the missing data means. Forward-fill is appropriate for prices (the last known price is the best estimate). Interpolation works for smooth physical measurements (temperature, pressure). Zero-fill works for counts and volumes (no data = nothing happened).

The `apply()` Problem - And What to Use Instead

.apply() is Pandas' escape hatch - it runs a Python function on each row, column, or group. It's tempting to use for everything, and that temptation should be resisted:

# SLOW: apply runs a Python loop internally (~100x slower)
df['category'] = df['price'].apply(lambda x: 'high' if x > 50000 else 'low')

# FAST: vectorized with np.where
df['category'] = np.where(df['price'] > 50000, 'high', 'low')

# SLOW: apply for string operations
df['upper'] = df['name'].apply(lambda x: x.upper())

# FAST: vectorized string accessor
df['upper'] = df['name'].str.upper()

# SLOW: apply for multiple conditions
def classify(row):
    if row['price'] > 50000 and row['volume'] > 3000:
        return 'hot'
    elif row['price'] > 50000:
        return 'expensive'
    else:
        return 'normal'

df['class'] = df.apply(classify, axis=1)

# FAST: np.select for multiple conditions
conditions = [
    (df['price'] > 50000) & (df['volume'] > 3000),
    df['price'] > 50000,
]
choices = ['hot', 'expensive']
df['class'] = np.select(conditions, choices, default='normal')

np.select is the vectorized replacement for multi-branch if/elif/else logic. It evaluates conditions in order and returns the first match, just like if/elif. For two conditions, np.where is enough; for three or more, np.select is the tool.

When apply() IS appropriate:

Complex logic that genuinely can't be vectorized (e.g., calling an external API per row)
GroupBy operations that need access to the full group as a DataFrame
Prototyping - get it working first, then optimize

When to avoid it:

Element-wise numeric or string operations (use vectorized ops)
Conditional assignment (use np.where or np.select)
Any operation where performance matters on >10K rows

The performance difference is real and grows with data size: on a million-row DataFrame, a vectorized operation might take 10ms while apply() takes 3 seconds. That's 300x slower for doing the exact same thing.

Advanced Merge Patterns

Episodes #30-33 covered pd.merge() for SQL-style joins. Here are the advanced patterns for real-world scenarios.

Merge with Indicator

Know which rows matched and which didn't:

result = pd.merge(users, orders, on='user_id', how='outer', indicator=True)
# _merge column values: 'left_only', 'right_only', 'both'

# Find users who never placed an order
inactive = result.query('_merge == "left_only"')

# Find orphaned orders (no matching user)
orphaned = result.query('_merge == "right_only"')

This is invaluable for data quality checks - quickly finding records that should have matched but didn't.

merge_asof - Inexact Time-Based Joins

This is the killer feature for time series work. When two datasets have timestamps that don't align exactly (which is always the case in practice):

# Trades happen at specific moments
trades = pd.DataFrame({
    'time': pd.to_datetime(['10:01:03', '10:02:15', '10:03:45']),
    'symbol': ['BTC', 'ETH', 'BTC'],
    'size': [1.0, 5.0, 2.0],
})

# Quotes update at different moments
quotes = pd.DataFrame({
    'time': pd.to_datetime(['10:01:00', '10:02:00', '10:03:00', '10:03:30']),
    'symbol': ['BTC', 'ETH', 'BTC', 'BTC'],
    'bid': [49990, 2990, 50090, 50120],
    'ask': [50010, 3010, 50110, 50140],
})

# Each trade gets the MOST RECENT quote for its symbol
merged = pd.merge_asof(
    trades.sort_values('time'),
    quotes.sort_values('time'),
    on='time',
    by='symbol',       # Match within same symbol
    direction='backward'  # Look backward in time (default)
)

merge_asof finds the nearest match in time instead of requiring exact equality. The direction parameter controls whether to look backward (most recent), forward (next upcoming), or nearest. This is essential when combining trade data with quote data, sensor readings at different frequencies, or any two time series that weren't sampled simultaneously.

You can also set a tolerance to avoid matching records too far apart:

# Only match if quote is within 5 seconds of trade
merged = pd.merge_asof(
    trades.sort_values('time'),
    quotes.sort_values('time'),
    on='time',
    by='symbol',
    tolerance=pd.Timedelta('5s')  # No match beyond 5 seconds
)

Practical Example: Building a Complete Analysis Pipeline

Let's tie everything together in a pipeline that uses multiple techniques:

import pandas as pd
import numpy as np

def load_ohlcv(filepath):
    return (
        pd.read_parquet(filepath, dtype_backend='pyarrow')
        .set_index('date')
        .sort_index()
    )

def add_technical_features(df):
    return df.assign(
        ma_7=lambda d: d['close'].rolling(7, min_periods=1).mean(),
        ma_30=lambda d: d['close'].rolling(30, min_periods=1).mean(),
        volatility=lambda d: d['close'].rolling(30).std(),
        daily_return=lambda d: d['close'].pct_change(),
        volume_ma=lambda d: d['volume'].rolling(7, min_periods=1).mean(),
        volume_ratio=lambda d: d['volume'] / d['volume'].rolling(7).mean(),
        trend=lambda d: np.where(d['close'] > d['close'].rolling(30).mean(), 'up', 'down'),
    )

def monthly_report(df):
    return (
        df.resample('ME').agg(
            open=('open', 'first'),
            high=('high', 'max'),
            low=('low', 'min'),
            close=('close', 'last'),
            total_volume=('volume', 'sum'),
            avg_volatility=('volatility', 'mean'),
            pct_up_days=('daily_return', lambda x: (x > 0).mean() * 100),
        )
        .assign(monthly_return=lambda d: d['close'].pct_change() * 100)
        .round(2)
    )

# The pipeline: load → enrich → summarize → filter
report = (
    load_ohlcv('btc_daily.parquet')
    .pipe(add_technical_features)
    .pipe(monthly_report)
    .query('monthly_return > 0')  # Only positive months
)

Each function is independently testable, reusable across different assets, and the pipeline reads like a description of what it does. This is the payoff of combining method chaining (episode #54) with the window functions, resampling, and advanced patterns from this episode.

What to remember from this one

In this episode, we went deep on advanced Pandas techniques:

Rolling windows with min_periods and raw=True for performance - and exponential moving averages for recency-weighted calculations
Grouped rolling via .transform() applies window functions independently per group while maintaining index alignment
MultiIndex for hierarchical data with .xs() for cross-sectional selection and column flattening for export
Pivot tables, stacking, and unstacking reshape data between long and wide formats
Resampling changes frequency (daily to monthly, hourly to daily) with proper aggregation
Shifting enables temporal comparisons and feature engineering - but watch out for data leakage with forward shifts
Time zones: store as UTC, convert for display only
np.where and np.select are the vectorized replacements for .apply() - 100-300x faster on large datasets
merge_asof with tolerance aligns time series at different frequencies - the killer feature for real-world temporal joins
Merge indicators track which records matched, essential for data quality checks

These techniques bridge the gap between "loading a CSV" and "doing actual data science." Most of your data work will use some combination of window functions, resampling, and advanced merges. Master these, and the analysis itself becomes the easy part.

Bedankt en tot de volgende! ;-)

Hive account@scipio

Learn Python Series (#55) - Advanced Data Wrangling with Pandas

Learn Python Series (#55) - Advanced Data Wrangling with Pandas

Repository

What will I learn

Requirements

Difficulty

Curriculum (of the Learn Python Series):

GitHub Account

Learn Python Series (#55) - Advanced Data Wrangling with Pandas

Window Functions - Rolling, Expanding, and Grouped

Rolling Windows

Expanding Windows

Grouped Rolling - The Real Power

Custom Window Functions

MultiIndex - Hierarchical Indexing

Selecting Data from a MultiIndex

Pivot Tables and Reshaping

Time Series - Resampling and Shifting

Resampling - Changing Frequency

Shifting - Temporal Comparisons

Time Zone Handling

Filling Gaps in Time Series

The apply() Problem - And What to Use Instead

Advanced Merge Patterns

Merge with Indicator

merge_asof - Inexact Time-Based Joins

Practical Example: Building a Complete Analysis Pipeline

What to remember from this one

Bedankt en tot de volgende! ;-)

Curriculum (of the `Learn Python Series`):

The `apply()` Problem - And What to Use Instead