The Feature Engineering Philosophy
Feature engineering for financial ML is different from other domains because every feature is a hypothesis about market microstructure. "200-day moving average" is not just a number — it's a claim that whether price is above or below its 200-day average contains information about future returns. That hypothesis might be true, partially true, regime-dependent, or wrong.
The goal is not to include every possible technical indicator. The goal is to include features that:
- Have an economic rationale (why would this predict anything?)
- Are computable without future information
- Are not perfectly collinear with each other (redundant features waste capacity)
- Cover different "dimensions" of market behavior: trend, momentum, volatility, volume, sentiment
Category 1: Trend Features (Moving Averages)
Moving averages are the most commonly used features in technical analysis. They smooth out daily noise and expose underlying directional bias.
Simple Moving Average (SMA)
SMA(n)
The arithmetic mean of the last n closing prices. Slow to react to new information. Used to identify medium and long-term trend direction.
def add_moving_averages(df, windows=[7, 14, 50, 200]):
"""Compute SMAs per ticker — grouped to prevent ticker mixing."""
for w in windows:
# Grouped operation: each ticker gets its own independent rolling window
df[f'sma_{w}'] = df.groupby('ticker')['close'].transform(
lambda x: x.rolling(w, min_periods=w).mean()
)
# Price relative to SMA: +1 = above (bullish), -1 = below (bearish)
df[f'price_vs_sma_{w}'] = (df['close'] / df[f'sma_{w}']) - 1.0
return df
Why multiple windows?
Different timeframes capture different market participants' behavior:
| Window | Captures | Institutional use |
|---|---|---|
| 7-day SMA | Short-term momentum, mean-reversion signals | Day traders, short-term funds |
| 14-day SMA | 2-week trend, momentum persistence | Swing traders |
| 50-day SMA | Medium-term trend, growth vs. value rotation | Growth funds, sector rotation |
| 200-day SMA | Long-term trend, bull/bear market classification | Institutional allocators, quant funds |
Rather than using the raw SMA values (which have price-level scaling issues), we use the price relative to SMA — a ratio that tells us how far the current price is from its average. This is already approximately normalized and scale-invariant across tickers.
sma_200 and price_vs_sma_200 as separate features — they carry nearly identical information. Use the ratio version. Similarly, including sma_7, sma_14, and sma_50 simultaneously is high-redundancy. Consider computing sma_7 / sma_50 crossover ratio instead.Category 2: Momentum Features (RSI)
RSI = 100 − [100 / (1 + RS)]
RS = Average Gain over N periods / Average Loss over N periods
def compute_rsi(series: pd.Series, window: int = 14) -> pd.Series:
"""RSI computation — Wilder smoothing method."""
delta = series.diff()
gain = delta.clip(lower=0)
loss = -delta.clip(upper=0)
# Wilder smoothing (exponential, alpha = 1/window)
avg_gain = gain.ewm(alpha=1/window, min_periods=window, adjust=False).mean()
avg_loss = loss.ewm(alpha=1/window, min_periods=window, adjust=False).mean()
rs = avg_gain / (avg_loss + 1e-10) # avoid division by zero
return 100 - (100 / (1 + rs))
def add_rsi_features(df, windows=[7, 14]):
for w in windows:
df[f'rsi_{w}'] = df.groupby('ticker')['close'].transform(
lambda x: compute_rsi(x, window=w)
)
return df
RSI as a feature — what the model might learn
RSI signals mean-reversion tendencies (extreme values tend to snap back) and momentum continuation (moderate RSI in a trend tends to persist). Whether your specific dataset and prediction horizon support either pattern is an empirical question — RSI is a hypothesis, not a guarantee.
Category 3: Volume and Liquidity Features
Volume provides context that price alone cannot. A +2% move on 10x average volume means something very different from a +2% move on 0.3x average volume.
def add_volume_features(df, windows=[5, 20]):
# Volume relative to its own recent average (scale-invariant)
for w in windows:
df[f'vol_ratio_{w}d'] = df.groupby('ticker').apply(
lambda g: g['volume'] / g['volume'].rolling(w).mean()
).reset_index(level=0, drop=True)
# On-Balance Volume (OBV) — cumulative volume directionality
def compute_obv(group):
direction = np.sign(group['close'].diff())
return (direction * group['volume']).cumsum()
df['obv'] = df.groupby('ticker', group_keys=False).apply(compute_obv)
# OBV trend (slope over 10 days) as feature
df['obv_trend_10d'] = df.groupby('ticker')['obv'].transform(
lambda x: x.rolling(10).apply(
lambda v: np.polyfit(range(len(v)), v, 1)[0], raw=True
)
)
return df
Category 4: Volatility Features
Volatility is arguably the most important feature for options-adjacent strategies and for regime detection. Multiple volatility measures capture different aspects.
Historical volatility (realized vol)
def add_volatility_features(df, windows=[5, 10, 20, 60]):
# Log returns (needed for vol computation)
df['log_ret'] = df.groupby('ticker')['close'].transform(
lambda x: np.log(x / x.shift(1))
)
for w in windows:
# Annualized historical volatility
df[f'hvol_{w}d'] = df.groupby('ticker')['log_ret'].transform(
lambda x: x.rolling(w).std() * np.sqrt(252)
)
# Volatility ratio: short-term vs. long-term (vol regime indicator)
df['vol_ratio'] = df['hvol_5d'] / (df['hvol_60d'] + 1e-10)
# Intraday range as proportion of close (daily ATR proxy)
df['daily_range'] = (df['high'] - df['low']) / df['close']
df['range_vs_20d_avg'] = df.groupby('ticker')['daily_range'].transform(
lambda x: x / x.rolling(20).mean()
)
return df
IV/HV ratio (from options data)
From our observations in this project, the iv_hv_ratio was one of the highest-weighted features in the TFT's gate — the model learned it was contextually very important.
Category 5: Asset-Type Embeddings
When a model processes multiple asset types — stocks, commodities, bonds — it benefits from knowing the type of asset it's looking at, not just the numeric values. This is done via an embedding layer.
A concrete example
Suppose you have three asset types: equity (0), commodity (1), bond (2). You choose embedding dimension = 4. The embedding table might learn:
dim0 dim1 dim2 dim3
equity (0): [ 0.12, -0.08, 0.41, 0.03]
commodity (1):[ 0.88, 0.31, -0.15, 0.72]
bond (2): [-0.45, 0.62, 0.28, -0.19]
These numbers are not set manually — they emerge from training. The model uses these 4 extra numbers as context for any given row, helping it apply different patterns to equities vs. commodities.
import torch
import torch.nn as nn
class AssetTypeEmbedding(nn.Module):
def __init__(self, num_asset_types: int, embed_dim: int = 8):
super().__init__()
self.embedding = nn.Embedding(num_asset_types, embed_dim)
def forward(self, asset_type_ids: torch.Tensor) -> torch.Tensor:
# asset_type_ids: (batch_size, seq_len) — integer IDs
return self.embedding(asset_type_ids)
# Output: (batch_size, seq_len, embed_dim)
# In your data preparation:
ASSET_TYPE_MAP = {'equity': 0, 'commodity': 1, 'bond': 2, 'crypto': 3}
df['asset_type_id'] = df['asset_class'].map(ASSET_TYPE_MAP)
Ticker-level embeddings
You can take this further by embedding individual tickers (not just asset types). This allows the model to learn "NVDA behaves differently from COST" at a fundamental level, rather than just inferring it from the features.
class TickerEmbedding(nn.Module):
def __init__(self, num_tickers: int, embed_dim: int = 16):
super().__init__()
self.embedding = nn.Embedding(num_tickers, embed_dim)
def forward(self, ticker_ids: torch.Tensor) -> torch.Tensor:
return self.embedding(ticker_ids)
# TICKER_MAP: {'AAPL': 0, 'MSFT': 1, ...}
# IMPORTANT: this map must be saved and loaded consistently
# between training runs — ticker IDs must be stable
Category 6: Seasonality Features
Financial markets have well-documented seasonal patterns: January effect, tax-loss harvesting in December, earnings seasons in January/April/July/October, low-volume summer lulls, end-of-quarter rebalancing. Encoding these explicitly gives the model a shortcut to patterns it would otherwise need many years of data to learn.
def add_seasonality_features(df):
dates = pd.DatetimeIndex(df.index.get_level_values('date'))
# Month of year (1-12) — learnable via embedding or sin/cos encoding
df['month'] = dates.month
# Day of week (0-4 Monday-Friday)
df['dow'] = dates.dayofweek
# Quarter (1-4) — earnings season alignment
df['quarter'] = dates.quarter
# Earnings season flag (approx. months 1,4,7,10)
df['earnings_season'] = df['month'].isin([1, 4, 7, 10]).astype(int)
# End-of-month flag (last 3 trading days of month) — fund flows
df['month_end'] = (dates.day >= 26).astype(int)
# Cyclical encoding for month (preserves Dec→Jan continuity)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
return df
Category 7: Cross-Ticker / Sector Features
Individual stocks don't move in isolation. NVDA often drags AMD. When tech is selling off broadly, individual names face headwinds regardless of their own fundamentals. Encoding sector context helps the model understand the broader environment.
def add_sector_context_features(df, sector_map):
df['sector'] = df.index.get_level_values('ticker').map(sector_map)
# Sector return: average return of all peers in the same sector, same day
# This is a PEER feature — completely valid, no future info
sector_daily_ret = (
df.groupby(['date', 'sector'])['log_ret']
.mean()
.rename('sector_avg_ret')
)
df = df.join(sector_daily_ret, on=['date', 'sector'])
# Ticker return relative to its sector average
df['alpha_vs_sector'] = df['log_ret'] - df['sector_avg_ret']
# Sector volatility
sector_vol = (
df.groupby(['date', 'sector'])['log_ret']
.std()
.rename('sector_vol')
)
df = df.join(sector_vol, on=['date', 'sector'])
return df
Regularization: L1, L2, and Elastic Net
With a rich feature set, models can overfit — memorizing training-period patterns that don't generalize. Regularization adds a penalty term to the loss function that discourages extreme weight values.
Loss_L1 = MSE + λ · Σ|wᵢ|
L2 Regularization (Ridge)
Adds the sum of squared weight values to the loss. Keeps all weights small but non-zero. Good for preventing any single feature from dominating. Standard for neural networks (called "weight decay").
Loss_L2 = MSE + λ · Σwᵢ²
Elastic Net
Combines L1 and L2. Gets sparsity from L1 and stability from L2. Best of both worlds for high-dimensional financial feature sets.
# In PyTorch: L2 is built into the optimizer as weight_decay
optimizer = torch.optim.Adam(
model.parameters(),
lr=1e-3,
weight_decay=1e-4 # This IS L2 regularization
)
# L1 is added manually to the loss:
def compute_loss_with_l1(predictions, targets, model, l1_lambda=1e-5):
mse_loss = F.mse_loss(predictions, targets)
l1_penalty = sum(p.abs().sum() for p in model.parameters())
return mse_loss + l1_lambda * l1_penalty
These are hyperparameters — settings you choose before training. They're not per-feature, they're per-model. Typical starting values: weight_decay ≈ 1e-4, l1_lambda ≈ 1e-5. You adjust them based on whether the model overfits.
The Full Feature Set Summary
After all the above, our feature matrix for each (date, ticker) row looks like:
| Category | Features | Count |
|---|---|---|
| Trend | price_vs_sma_7, price_vs_sma_14, price_vs_sma_50, price_vs_sma_200 | 4 |
| Momentum | rsi_7, rsi_14 | 2 |
| Volume | vol_ratio_5d, vol_ratio_20d, obv_trend_10d | 3 |
| Volatility | hvol_5d, hvol_20d, hvol_60d, vol_ratio, daily_range_vs_avg | 5 |
| Options | iv_30d, iv_hv_ratio, put_call_ratio, skew_25d, term_slope | 5 |
| Sentiment | sentiment_score, sentiment_7d_ma, has_sentiment | 3 |
| Sector context | sector_avg_ret, alpha_vs_sector, sector_vol | 3 |
| Seasonality | month_sin, month_cos, quarter, earnings_season, month_end | 5 |
| Embeddings (IDs) | ticker_id, asset_type_id | 2 (→ vectors) |
| Total | ~30 features + embeddings |
In Part 4, we'll see what happens when we feed this feature matrix into a Standard Transformer — and why it fails in a specific, diagnosable way.
Series — Building a Quantitative Trading System