Feature Engineering: From OHLCV to a Rich Feature Matrix

The Feature Engineering Philosophy

Feature engineering for financial ML is different from other domains because every feature is a hypothesis about market microstructure. "200-day moving average" is not just a number — it's a claim that whether price is above or below its 200-day average contains information about future returns. That hypothesis might be true, partially true, regime-dependent, or wrong.

The goal is not to include every possible technical indicator. The goal is to include features that:

Have an economic rationale (why would this predict anything?)
Are computable without future information
Are not perfectly collinear with each other (redundant features waste capacity)
Cover different "dimensions" of market behavior: trend, momentum, volatility, volume, sentiment

Category 1: Trend Features (Moving Averages)

Moving averages are the most commonly used features in technical analysis. They smooth out daily noise and expose underlying directional bias.

Simple Moving Average (SMA)

SMA(n)

The arithmetic mean of the last n closing prices. Slow to react to new information. Used to identify medium and long-term trend direction.

def add_moving_averages(df, windows=[7, 14, 50, 200]):

"""Compute SMAs per ticker — grouped to prevent ticker mixing."""
for w in windows:

# Grouped operation: each ticker gets its own independent rolling window
df[f'sma_{w}'] = df.groupby('ticker')['close'].transform(

lambda x: x.rolling(w, min_periods=w).mean()
)
# Price relative to SMA: +1 = above (bullish), -1 = below (bearish)
df[f'price_vs_sma_{w}'] = (df['close'] / df[f'sma_{w}']) - 1.0
return df

Why multiple windows?

Different timeframes capture different market participants' behavior:

Window	Captures	Institutional use
7-day SMA	Short-term momentum, mean-reversion signals	Day traders, short-term funds
14-day SMA	2-week trend, momentum persistence	Swing traders
50-day SMA	Medium-term trend, growth vs. value rotation	Growth funds, sector rotation
200-day SMA	Long-term trend, bull/bear market classification	Institutional allocators, quant funds

Rather than using the raw SMA values (which have price-level scaling issues), we use the price relative to SMA — a ratio that tells us how far the current price is from its average. This is already approximately normalized and scale-invariant across tickers.

Feature selection insight

Don't include both sma_200 and price_vs_sma_200 as separate features — they carry nearly identical information. Use the ratio version. Similarly, including sma_7, sma_14, and sma_50 simultaneously is high-redundancy. Consider computing sma_7 / sma_50 crossover ratio instead.

Category 2: Momentum Features (RSI)

RSI — Relative Strength Index

A momentum oscillator that measures the speed and magnitude of price changes. Range: 0–100. Values above 70 conventionally signal overbought conditions; below 30 signal oversold. Most commonly computed over 14 periods.

RSI = 100 − [100 / (1 + RS)]

RS = Average Gain over N periods / Average Loss over N periods

def compute_rsi(series: pd.Series, window: int = 14) -> pd.Series:

"""RSI computation — Wilder smoothing method."""
delta = series.diff()
gain = delta.clip(lower=0)
loss = -delta.clip(upper=0)

# Wilder smoothing (exponential, alpha = 1/window)
avg_gain = gain.ewm(alpha=1/window, min_periods=window, adjust=False).mean()
avg_loss = loss.ewm(alpha=1/window, min_periods=window, adjust=False).mean()

rs = avg_gain / (avg_loss + 1e-10)  # avoid division by zero
return 100 - (100 / (1 + rs))

def add_rsi_features(df, windows=[7, 14]):
for w in windows:
df[f'rsi_{w}'] = df.groupby('ticker')['close'].transform(

lambda x: compute_rsi(x, window=w)
)
return df

RSI as a feature — what the model might learn

RSI signals mean-reversion tendencies (extreme values tend to snap back) and momentum continuation (moderate RSI in a trend tends to persist). Whether your specific dataset and prediction horizon support either pattern is an empirical question — RSI is a hypothesis, not a guarantee.

Category 3: Volume and Liquidity Features

Volume provides context that price alone cannot. A +2% move on 10x average volume means something very different from a +2% move on 0.3x average volume.

def add_volume_features(df, windows=[5, 20]):

# Volume relative to its own recent average (scale-invariant)
for w in windows:
df[f'vol_ratio_{w}d'] = df.groupby('ticker').apply(

lambda g: g['volume'] / g['volume'].rolling(w).mean()
).reset_index(level=0, drop=True)

# On-Balance Volume (OBV) — cumulative volume directionality
def compute_obv(group):
direction = np.sign(group['close'].diff())
return (direction * group['volume']).cumsum()

df['obv'] = df.groupby('ticker', group_keys=False).apply(compute_obv)

# OBV trend (slope over 10 days) as feature
df['obv_trend_10d'] = df.groupby('ticker')['obv'].transform(

lambda x: x.rolling(10).apply(
lambda v: np.polyfit(range(len(v)), v, 1)[0], raw=True
)
)
return df

Category 4: Volatility Features

Volatility is arguably the most important feature for options-adjacent strategies and for regime detection. Multiple volatility measures capture different aspects.

Historical volatility (realized vol)

def add_volatility_features(df, windows=[5, 10, 20, 60]):

# Log returns (needed for vol computation)
df['log_ret'] = df.groupby('ticker')['close'].transform(

lambda x: np.log(x / x.shift(1))
)

for w in windows:

# Annualized historical volatility
df[f'hvol_{w}d'] = df.groupby('ticker')['log_ret'].transform(

lambda x: x.rolling(w).std() * np.sqrt(252)
)

# Volatility ratio: short-term vs. long-term (vol regime indicator)
df['vol_ratio'] = df['hvol_5d'] / (df['hvol_60d'] + 1e-10)

# Intraday range as proportion of close (daily ATR proxy)
df['daily_range'] = (df['high'] - df['low']) / df['close']
df['range_vs_20d_avg'] = df.groupby('ticker')['daily_range'].transform(

lambda x: x / x.rolling(20).mean()
)
return df

IV/HV ratio (from options data)

IV/HV Ratio

Implied Volatility divided by Historical Volatility. A ratio above 1 means the market is pricing in more uncertainty than what's been realized — often happens before earnings, macro events, or during fear cycles. Below 1 means options are "cheap" relative to realized moves.

From our observations in this project, the iv_hv_ratio was one of the highest-weighted features in the TFT's gate — the model learned it was contextually very important.

Category 5: Asset-Type Embeddings

When a model processes multiple asset types — stocks, commodities, bonds — it benefits from knowing the type of asset it's looking at, not just the numeric values. This is done via an embedding layer.

Embedding Layer

A lookup table that maps a categorical ID (like "commodity" = 1) to a dense vector of real numbers. These vectors are learned during training — the model discovers the best representation. The resulting vectors encode the "character" of each category in a way the rest of the network can use.

A concrete example

Suppose you have three asset types: equity (0), commodity (1), bond (2). You choose embedding dimension = 4. The embedding table might learn:

dim0 dim1 dim2 dim3

equity (0):   [ 0.12, -0.08,  0.41,  0.03]
commodity (1):[ 0.88,  0.31, -0.15,  0.72]
bond (2):     [-0.45,  0.62,  0.28, -0.19]

These numbers are not set manually — they emerge from training. The model uses these 4 extra numbers as context for any given row, helping it apply different patterns to equities vs. commodities.

import torch
import torch.nn as nn

class AssetTypeEmbedding(nn.Module):
def __init__(self, num_asset_types: int, embed_dim: int = 8):
super().__init__()

self.embedding = nn.Embedding(num_asset_types, embed_dim)

def forward(self, asset_type_ids: torch.Tensor) -> torch.Tensor:

# asset_type_ids: (batch_size, seq_len) — integer IDs
return self.embedding(asset_type_ids)
    # Output: (batch_size, seq_len, embed_dim)

# In your data preparation:
ASSET_TYPE_MAP = {'equity': 0, 'commodity': 1, 'bond': 2, 'crypto': 3}
df['asset_type_id'] = df['asset_class'].map(ASSET_TYPE_MAP)

Ticker-level embeddings

You can take this further by embedding individual tickers (not just asset types). This allows the model to learn "NVDA behaves differently from COST" at a fundamental level, rather than just inferring it from the features.

class TickerEmbedding(nn.Module):
def __init__(self, num_tickers: int, embed_dim: int = 16):
super().__init__()

self.embedding = nn.Embedding(num_tickers, embed_dim)

def forward(self, ticker_ids: torch.Tensor) -> torch.Tensor:
return self.embedding(ticker_ids)

# TICKER_MAP: {'AAPL': 0, 'MSFT': 1, ...}
# IMPORTANT: this map must be saved and loaded consistently
# between training runs — ticker IDs must be stable

Category 6: Seasonality Features

Financial markets have well-documented seasonal patterns: January effect, tax-loss harvesting in December, earnings seasons in January/April/July/October, low-volume summer lulls, end-of-quarter rebalancing. Encoding these explicitly gives the model a shortcut to patterns it would otherwise need many years of data to learn.

def add_seasonality_features(df):
dates = pd.DatetimeIndex(df.index.get_level_values('date'))

# Month of year (1-12) — learnable via embedding or sin/cos encoding
df['month'] = dates.month

# Day of week (0-4 Monday-Friday)
df['dow'] = dates.dayofweek

# Quarter (1-4) — earnings season alignment
df['quarter'] = dates.quarter

# Earnings season flag (approx. months 1,4,7,10)
df['earnings_season'] = df['month'].isin([1, 4, 7, 10]).astype(int)

# End-of-month flag (last 3 trading days of month) — fund flows
df['month_end'] = (dates.day >= 26).astype(int)

# Cyclical encoding for month (preserves Dec→Jan continuity)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

return df

Cyclical encoding

Encoding month as a raw integer (1–12) means December (12) looks "far" from January (1) in the model's feature space. But they are adjacent in time. Sine/cosine encoding wraps the cycle so December and January are numerically close.

Category 7: Cross-Ticker / Sector Features

Individual stocks don't move in isolation. NVDA often drags AMD. When tech is selling off broadly, individual names face headwinds regardless of their own fundamentals. Encoding sector context helps the model understand the broader environment.

def add_sector_context_features(df, sector_map):
df['sector'] = df.index.get_level_values('ticker').map(sector_map)

# Sector return: average return of all peers in the same sector, same day
# This is a PEER feature — completely valid, no future info
sector_daily_ret = (

df.groupby(['date', 'sector'])['log_ret']
.mean()
.rename('sector_avg_ret')
)
df = df.join(sector_daily_ret, on=['date', 'sector'])

# Ticker return relative to its sector average
df['alpha_vs_sector'] = df['log_ret'] - df['sector_avg_ret']

# Sector volatility
sector_vol = (

df.groupby(['date', 'sector'])['log_ret']
.std()
.rename('sector_vol')
)
df = df.join(sector_vol, on=['date', 'sector'])

return df

Careful with same-day sector returns

Using the sector's average return for the same day as a feature is only valid if the sector return is computed using the same or earlier data your prediction is based on (end-of-day). This is fine for a daily EOD prediction model.

Regularization: L1, L2, and Elastic Net

With a rich feature set, models can overfit — memorizing training-period patterns that don't generalize. Regularization adds a penalty term to the loss function that discourages extreme weight values.

L1 Regularization (Lasso)

Adds the sum of absolute weight values to the loss. Encourages sparse solutions — some weights go to exactly zero, effectively removing features. Good for feature selection.

Loss_L1 = MSE + λ · Σ|wᵢ|

L2 Regularization (Ridge)
Adds the sum of squared weight values to the loss. Keeps all weights small but non-zero. Good for preventing any single feature from dominating. Standard for neural networks (called "weight decay").

Loss_L2 = MSE + λ · Σwᵢ²

Elastic Net
Combines L1 and L2. Gets sparsity from L1 and stability from L2. Best of both worlds for high-dimensional financial feature sets.

# In PyTorch: L2 is built into the optimizer as weight_decay
optimizer = torch.optim.Adam(

model.parameters(),
lr=1e-3,
weight_decay=1e-4  # This IS L2 regularization

)

# L1 is added manually to the loss:
def compute_loss_with_l1(predictions, targets, model, l1_lambda=1e-5):
mse_loss = F.mse_loss(predictions, targets)
l1_penalty = sum(p.abs().sum() for p in model.parameters())
return mse_loss + l1_lambda * l1_penalty

These are hyperparameters — settings you choose before training. They're not per-feature, they're per-model. Typical starting values: weight_decay ≈ 1e-4, l1_lambda ≈ 1e-5. You adjust them based on whether the model overfits.

The Full Feature Set Summary

After all the above, our feature matrix for each (date, ticker) row looks like:

Category	Features	Count
Trend	price_vs_sma_7, price_vs_sma_14, price_vs_sma_50, price_vs_sma_200	4
Momentum	rsi_7, rsi_14	2
Volume	vol_ratio_5d, vol_ratio_20d, obv_trend_10d	3
Volatility	hvol_5d, hvol_20d, hvol_60d, vol_ratio, daily_range_vs_avg	5
Options	iv_30d, iv_hv_ratio, put_call_ratio, skew_25d, term_slope	5
Sentiment	sentiment_score, sentiment_7d_ma, has_sentiment	3
Sector context	sector_avg_ret, alpha_vs_sector, sector_vol	3
Seasonality	month_sin, month_cos, quarter, earnings_season, month_end	5
Embeddings (IDs)	ticker_id, asset_type_id	2 (→ vectors)
Total		~30 features + embeddings

In Part 4, we'll see what happens when we feed this feature matrix into a Standard Transformer — and why it fails in a specific, diagnosable way.

Series — Building a Quantitative Trading System

01FoundationsTransformers vs Diffusion vs TFT

02Data PipelineNASDAQ-100, Options, Sentiment 03Feature EngineeringYou are here