Learn Python Series (#55) - Advanced Data Wrangling with Pandas
Repository
What will I learn
- You will learn how window functions let you compute rolling, expanding, and grouped calculations;
- how MultiIndex gives you hierarchical indexing for complex datasets;
- time series resampling and shifting for temporal analysis;
- the
apply()escape hatch and why you should avoid it when possible; - advanced merge patterns including merge_asof for inexact time-based joins;
- real-world data wrangling patterns that go beyond basic filtering and groupby.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- Familiarity with Pandas basics (episodes #30-33) and modern Pandas (#54);
- The ambition to learn Python programming.
Difficulty
- Intermediate, advanced
Curriculum (of the Learn Python Series):
- Learn Python Series - Intro
- Learn Python Series (#2) - Handling Strings Part 1
- Learn Python Series (#3) - Handling Strings Part 2
- Learn Python Series (#4) - Round-Up #1
- Learn Python Series (#5) - Handling Lists Part 1
- Learn Python Series (#6) - Handling Lists Part 2
- Learn Python Series (#7) - Handling Dictionaries
- Learn Python Series (#8) - Handling Tuples
- Learn Python Series (#9) - Using Import
- Learn Python Series (#10) - Matplotlib Part 1
- Learn Python Series (#11) - NumPy Part 1
- Learn Python Series (#12) - Handling Files
- Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1
- Learn Python Series (#14) - Mini Project - Developing a Web Crawler Part 2
- Learn Python Series (#15) - Handling JSON
- Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
- Learn Python Series (#17) - Roundup #2 - Combining and analyzing any-to-any multi-currency historical data
- Learn Python Series (#18) - PyMongo Part 1
- Learn Python Series (#19) - PyMongo Part 2
- Learn Python Series (#20) - PyMongo Part 3
- Learn Python Series (#21) - Handling Dates and Time Part 1
- Learn Python Series (#22) - Handling Dates and Time Part 2
- Learn Python Series (#23) - Handling Regular Expressions Part 1
- Learn Python Series (#24) - Handling Regular Expressions Part 2
- Learn Python Series (#25) - Handling Regular Expressions Part 3
- Learn Python Series (#26) - pipenv & Visual Studio Code
- Learn Python Series (#27) - Handling Strings Part 3 (F-Strings)
- Learn Python Series (#28) - Using Pickle and Shelve
- Learn Python Series (#29) - Handling CSV
- Learn Python Series (#30) - Data Science Part 1 - Pandas
- Learn Python Series (#31) - Data Science Part 2 - Pandas
- Learn Python Series (#32) - Data Science Part 3 - Pandas
- Learn Python Series (#33) - Data Science Part 4 - Pandas
- Learn Python Series (#34) - Working with APIs in 2026: What's Changed
- Learn Python Series (#35) - Working with APIs Part 2: Beyond GET Requests
- Learn Python Series (#36) - Type Hints and Modern Python
- Learn Python Series (#37) - Virtual Environments and Dependency Management
- Learn Python Series (#38) - Testing Your Code Part 1
- Learn Python Series (#39) - Testing Your Code Part 2
- Learn Python Series (#40) - Asynchronous Python Part 1
- Learn Python Series (#41) - Asynchronous Python Part 2
- Learn Python Series (#42) - Building CLI Applications
- Learn Python Series (#43) - Mini Project - Crypto Price Tracker
- Learn Python Series (#44) - Context Managers & Decorators Deep Dive
- Learn Python Series (#45) - Metaclasses & Class Design Patterns
- Learn Python Series (#46) - Descriptors & Properties
- Learn Python Series (#47) - Generators & Iterators Advanced
- Learn Python Series (#48) - Concurrency - Threading vs Multiprocessing
- Learn Python Series (#49) - FastAPI Basics - Modern Web APIs
- Learn Python Series (#50) - FastAPI Advanced - Validation & Dependencies
- Learn Python Series (#51) - Database Integration - SQLAlchemy
- Learn Python Series (#52) - Authentication & Security
- Learn Python Series (#53) - Deployment & Production Best Practices
- Learn Python Series (#54) - Pandas in 2026: What's Changed
- Learn Python Series (#55) - Advanced Data Wrangling with Pandas (this post)
GitHub Account
Learn Python Series (#55) - Advanced Data Wrangling with Pandas
In episodes #30-33 we learned to load, filter, group, and merge DataFrames. Last episode we caught up with modern Pandas features - copy-on-write, Arrow backend, nullable dtypes, method chaining. Now it's time for the techniques that separate "I know Pandas" from "I can actually wrangle real-world data."
The basics get you 60% of the way. Groupby, merge, filter - those handle the clean, well-structured cases. But real data isn't clean. It has irregular timestamps, hierarchical dimensions, temporal patterns that need sliding windows, and datasets at different frequencies that need to be aligned. That's what this episode is about.
Nota bene: Data wrangling is where you spend 80% of your time in any data project. The actual analysis or model is often trivial once the data is shaped correctly. Mastering these techniques is what makes that 80% productive instead of painful.
Window Functions - Rolling, Expanding, and Grouped
Window functions compute values across a sliding window of rows - moving averages, running totals, rolling standard deviations. We touched on .rolling() briefly in episode #54, but windows go much deeper than a simple moving average.
Rolling Windows
A rolling window looks backward from each row:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': pd.date_range('2025-01-01', periods=100, freq='D'),
'price': np.random.randn(100).cumsum() + 100,
'volume': np.random.randint(1000, 5000, 100),
})
df = df.set_index('date')
# 7-day rolling mean
df['ma_7'] = df['price'].rolling(7).mean()
# Rolling with minimum periods - don't wait for full window
df['ma_7_partial'] = df['price'].rolling(7, min_periods=1).mean()
# Rolling standard deviation (volatility proxy)
df['volatility_30'] = df['price'].rolling(30).std()
# Exponentially weighted - recent values matter more
df['ema_7'] = df['price'].ewm(span=7).mean()
The min_periods parameter is important in practice. Without it, the first 6 rows in a 7-day rolling window return NaN - you lose data at the start of your series. With min_periods=1, you get a partial average from whatever data is available. Whether that's appropriate depends on your use case: for financial analysis, you usually want the full window; for dashboards and reports, partial windows are often fine.
Exponentially weighted moving averages (ewm) are worth knowing because they respond faster to recent changes than simple moving averages. The span parameter controls how quickly old data fades - span=7 means data 7 periods ago has about 37% of the weight of the current observation.
Expanding Windows
Expanding windows grow from the start of the series - every row includes all data from the beginning up to that point:
# Cumulative mean (running average)
df['cumulative_avg'] = df['price'].expanding().mean()
# All-time high (running maximum)
df['all_time_high'] = df['price'].expanding().max()
# Cumulative return since start
df['cum_return'] = (df['price'] / df['price'].iloc[0] - 1) * 100
Expanding windows are the right choice for cumulative metrics - year-to-date averages, running records, or any calculation where the "window" is "everything up to now."
Grouped Rolling - The Real Power
This is where things get interesting. Apply rolling calculations within groups:
trades = pd.DataFrame({
'date': pd.date_range('2025-01-01', periods=200, freq='D').tolist() * 2,
'symbol': ['BTC'] * 200 + ['ETH'] * 200,
'price': np.random.randn(400).cumsum() + 100,
})
trades = trades.sort_values(['symbol', 'date'])
# 7-day rolling mean PER symbol - independently for BTC and ETH
trades['ma_7'] = (
trades
.groupby('symbol')['price']
.transform(lambda x: x.rolling(7, min_periods=1).mean())
)
The .transform() method is crucial here - it returns a Series with the same index as the input, so you can assign it back as a column. .apply() on a GroupBy can return anything (a scalar, a Series of different length, a DataFrame), which makes alignment unpredictable. .transform() guarantees index alignment.
Custom Window Functions
When built-in aggregations aren't enough, .rolling().apply() lets you run arbitrary Python:
# Percentage of days in the window where price increased
def pct_up_days(window):
if len(window) < 2:
return np.nan
changes = np.diff(window)
return (changes > 0).sum() / len(changes) * 100
df['pct_up_7d'] = df['price'].rolling(7).apply(pct_up_days, raw=True)
The raw=True parameter passes the window as a NumPy array instead of a Pandas Series - this is significantly faster because it skips Series construction overhead. Always use raw=True unless your function needs index information.
MultiIndex - Hierarchical Indexing
When your data has natural hierarchies - exchange/pair, country/city, year/quarter/month - MultiIndex lets you work with them explicitly instead of flattening everything into separate columns:
trades = pd.DataFrame({
'exchange': ['Binance', 'Binance', 'Kraken', 'Kraken', 'Binance', 'Kraken'],
'pair': ['BTC/USDT', 'ETH/USDT', 'BTC/USDT', 'ETH/USDT', 'BTC/USDT', 'BTC/USDT'],
'volume': [100, 50, 80, 40, 110, 90],
'price': [50000, 3000, 50100, 3010, 50200, 50150],
})
multi = trades.set_index(['exchange', 'pair'])
print(multi)
volume price
exchange pair
Binance BTC/USDT 100 50000
ETH/USDT 50 3000
Kraken BTC/USDT 80 50100
ETH/USDT 40 3010
Binance BTC/USDT 110 50200
Kraken BTC/USDT 90 50150
Selecting Data from a MultiIndex
# All Binance rows
multi.loc['Binance']
# Specific exchange + pair
multi.loc[('Binance', 'BTC/USDT')]
# Cross-section: all BTC/USDT regardless of exchange
multi.xs('BTC/USDT', level='pair')
# Slice range on first level
multi.loc['Binance':'Kraken']
The .xs() (cross-section) method is the clean way to select on a non-first level. Without it, you'd need to use multi.loc[pd.IndexSlice[:, 'BTC/USDT'], :] - which works but is harder to read.
Pivot Tables and Reshaping
Pivot tables create MultiIndex naturally:
pivot = trades.pivot_table(
values=['volume', 'price'],
index='exchange',
columns='pair',
aggfunc={'volume': 'sum', 'price': 'mean'}
)
# Access: (outer column, inner column)
pivot[('volume', 'BTC/USDT')]
Stacking and unstacking reshape between long and wide formats - operations that come up constantly when preparing data for analysis or visualization:
# Wide to long (move column levels to index)
long = pivot.stack()
# Long to wide (move index levels to columns)
wide = long.unstack()
# Flatten a MultiIndex column header
pivot.columns = ['_'.join(col).strip() for col in pivot.columns.values]
# Now: 'volume_BTC/USDT', 'volume_ETH/USDT', 'price_BTC/USDT', 'price_ETH/USDT'
The column-flattening trick is essential when you need to export data or use it with tools that don't understand MultiIndex (plotting libraries, CSV exports, database inserts).
Time Series - Resampling and Shifting
Pandas was built for time series. Once your index is a DatetimeIndex, you unlock temporal operations that would be painful to implement manually.
Resampling - Changing Frequency
# Daily to monthly (downsampling - aggregate)
monthly = df.resample('ME').agg(
open=('price', 'first'),
high=('price', 'max'),
low=('price', 'min'),
close=('price', 'last'),
total_volume=('volume', 'sum'),
avg_volume=('volume', 'mean'),
)
# Daily to weekly
weekly = df.resample('W-MON').mean() # Week starts Monday
# Monthly to daily (upsampling - interpolate)
daily_from_monthly = monthly['close'].resample('D').interpolate(method='linear')
The resampling frequency strings are worth memorizing: 'D' (day), 'W' (week), 'ME' (month-end), 'MS' (month-start), 'QE' (quarter-end), 'YE' (year-end), 'h' (hour), 'min' (minute). The ME/MS distinction matters - 'ME' labels each group with the last day of the month, 'MS' with the first.
Shifting - Temporal Comparisons
# Previous day's price
df['prev_close'] = df['price'].shift(1)
# Daily return (percentage change)
df['daily_return'] = df['price'].pct_change()
# 7-day return
df['weekly_return'] = df['price'].pct_change(7)
# Difference from 30 days ago (absolute, not percentage)
df['price_change_30d'] = df['price'].diff(30)
# Next day's price (negative shift = look forward)
df['next_close'] = df['price'].shift(-1)
Forward-looking shifts (shift(-1)) are used for creating target variables in predictive modeling - "given today's features, predict tomorrow's price." Be careful not to accidentally use future data in your features (this is called data leakage and it makes your models look great in backtesting and terrible in production).
Time Zone Handling
# Localize naive timestamps to UTC
df.index = df.index.tz_localize('UTC')
# Convert to another timezone
df.index = df.index.tz_convert('Europe/Amsterdam')
# Create timezone-aware range directly
idx = pd.date_range('2025-01-01', periods=100, freq='h', tz='US/Eastern')
Always store timestamps in UTC internally and convert to local time only for display. This avoids daylight saving time bugs (clocks jumping forward/backward create duplicate or missing hours), ambiguous timestamps during DST transitions, and confusion when combining data from different geographic sources.
Filling Gaps in Time Series
Real-world time series often have missing dates (weekends for stock data, downtime for sensors). Reindex to fill the gaps:
# Create complete date range and reindex
full_range = pd.date_range(df.index.min(), df.index.max(), freq='D')
df = df.reindex(full_range)
# Fill gaps with various strategies
df['price_ffill'] = df['price'].ffill() # Forward-fill: use last known value
df['price_interp'] = df['price'].interpolate() # Linear interpolation
df['volume_zero'] = df['volume'].fillna(0) # Fill with zero (no trades = zero volume)
The choice of fill strategy depends on what the missing data means. Forward-fill is appropriate for prices (the last known price is the best estimate). Interpolation works for smooth physical measurements (temperature, pressure). Zero-fill works for counts and volumes (no data = nothing happened).
The apply() Problem - And What to Use Instead
.apply() is Pandas' escape hatch - it runs a Python function on each row, column, or group. It's tempting to use for everything, and that temptation should be resisted:
# SLOW: apply runs a Python loop internally (~100x slower)
df['category'] = df['price'].apply(lambda x: 'high' if x > 50000 else 'low')
# FAST: vectorized with np.where
df['category'] = np.where(df['price'] > 50000, 'high', 'low')
# SLOW: apply for string operations
df['upper'] = df['name'].apply(lambda x: x.upper())
# FAST: vectorized string accessor
df['upper'] = df['name'].str.upper()
# SLOW: apply for multiple conditions
def classify(row):
if row['price'] > 50000 and row['volume'] > 3000:
return 'hot'
elif row['price'] > 50000:
return 'expensive'
else:
return 'normal'
df['class'] = df.apply(classify, axis=1)
# FAST: np.select for multiple conditions
conditions = [
(df['price'] > 50000) & (df['volume'] > 3000),
df['price'] > 50000,
]
choices = ['hot', 'expensive']
df['class'] = np.select(conditions, choices, default='normal')
np.select is the vectorized replacement for multi-branch if/elif/else logic. It evaluates conditions in order and returns the first match, just like if/elif. For two conditions, np.where is enough; for three or more, np.select is the tool.
When apply() IS appropriate:
- Complex logic that genuinely can't be vectorized (e.g., calling an external API per row)
- GroupBy operations that need access to the full group as a DataFrame
- Prototyping - get it working first, then optimize
When to avoid it:
- Element-wise numeric or string operations (use vectorized ops)
- Conditional assignment (use
np.whereornp.select) - Any operation where performance matters on >10K rows
The performance difference is real and grows with data size: on a million-row DataFrame, a vectorized operation might take 10ms while apply() takes 3 seconds. That's 300x slower for doing the exact same thing.
Advanced Merge Patterns
Episodes #30-33 covered pd.merge() for SQL-style joins. Here are the advanced patterns for real-world scenarios.
Merge with Indicator
Know which rows matched and which didn't:
result = pd.merge(users, orders, on='user_id', how='outer', indicator=True)
# _merge column values: 'left_only', 'right_only', 'both'
# Find users who never placed an order
inactive = result.query('_merge == "left_only"')
# Find orphaned orders (no matching user)
orphaned = result.query('_merge == "right_only"')
This is invaluable for data quality checks - quickly finding records that should have matched but didn't.
merge_asof - Inexact Time-Based Joins
This is the killer feature for time series work. When two datasets have timestamps that don't align exactly (which is always the case in practice):
# Trades happen at specific moments
trades = pd.DataFrame({
'time': pd.to_datetime(['10:01:03', '10:02:15', '10:03:45']),
'symbol': ['BTC', 'ETH', 'BTC'],
'size': [1.0, 5.0, 2.0],
})
# Quotes update at different moments
quotes = pd.DataFrame({
'time': pd.to_datetime(['10:01:00', '10:02:00', '10:03:00', '10:03:30']),
'symbol': ['BTC', 'ETH', 'BTC', 'BTC'],
'bid': [49990, 2990, 50090, 50120],
'ask': [50010, 3010, 50110, 50140],
})
# Each trade gets the MOST RECENT quote for its symbol
merged = pd.merge_asof(
trades.sort_values('time'),
quotes.sort_values('time'),
on='time',
by='symbol', # Match within same symbol
direction='backward' # Look backward in time (default)
)
merge_asof finds the nearest match in time instead of requiring exact equality. The direction parameter controls whether to look backward (most recent), forward (next upcoming), or nearest. This is essential when combining trade data with quote data, sensor readings at different frequencies, or any two time series that weren't sampled simultaneously.
You can also set a tolerance to avoid matching records too far apart:
# Only match if quote is within 5 seconds of trade
merged = pd.merge_asof(
trades.sort_values('time'),
quotes.sort_values('time'),
on='time',
by='symbol',
tolerance=pd.Timedelta('5s') # No match beyond 5 seconds
)
Practical Example: Building a Complete Analysis Pipeline
Let's tie everything together in a pipeline that uses multiple techniques:
import pandas as pd
import numpy as np
def load_ohlcv(filepath):
return (
pd.read_parquet(filepath, dtype_backend='pyarrow')
.set_index('date')
.sort_index()
)
def add_technical_features(df):
return df.assign(
ma_7=lambda d: d['close'].rolling(7, min_periods=1).mean(),
ma_30=lambda d: d['close'].rolling(30, min_periods=1).mean(),
volatility=lambda d: d['close'].rolling(30).std(),
daily_return=lambda d: d['close'].pct_change(),
volume_ma=lambda d: d['volume'].rolling(7, min_periods=1).mean(),
volume_ratio=lambda d: d['volume'] / d['volume'].rolling(7).mean(),
trend=lambda d: np.where(d['close'] > d['close'].rolling(30).mean(), 'up', 'down'),
)
def monthly_report(df):
return (
df.resample('ME').agg(
open=('open', 'first'),
high=('high', 'max'),
low=('low', 'min'),
close=('close', 'last'),
total_volume=('volume', 'sum'),
avg_volatility=('volatility', 'mean'),
pct_up_days=('daily_return', lambda x: (x > 0).mean() * 100),
)
.assign(monthly_return=lambda d: d['close'].pct_change() * 100)
.round(2)
)
# The pipeline: load → enrich → summarize → filter
report = (
load_ohlcv('btc_daily.parquet')
.pipe(add_technical_features)
.pipe(monthly_report)
.query('monthly_return > 0') # Only positive months
)
Each function is independently testable, reusable across different assets, and the pipeline reads like a description of what it does. This is the payoff of combining method chaining (episode #54) with the window functions, resampling, and advanced patterns from this episode.
What to remember from this one
In this episode, we went deep on advanced Pandas techniques:
- Rolling windows with
min_periodsandraw=Truefor performance - and exponential moving averages for recency-weighted calculations - Grouped rolling via
.transform()applies window functions independently per group while maintaining index alignment - MultiIndex for hierarchical data with
.xs()for cross-sectional selection and column flattening for export - Pivot tables, stacking, and unstacking reshape data between long and wide formats
- Resampling changes frequency (daily to monthly, hourly to daily) with proper aggregation
- Shifting enables temporal comparisons and feature engineering - but watch out for data leakage with forward shifts
- Time zones: store as UTC, convert for display only
np.whereandnp.selectare the vectorized replacements for.apply()- 100-300x faster on large datasets- merge_asof with tolerance aligns time series at different frequencies - the killer feature for real-world temporal joins
- Merge indicators track which records matched, essential for data quality checks
These techniques bridge the gap between "loading a CSV" and "doing actual data science." Most of your data work will use some combination of window functions, resampling, and advanced merges. Master these, and the analysis itself becomes the easy part.