Validation Discipline: Catching Leakage, Rank IC, and the Economic Viability Wall

Why Validation Is the Hardest Part

Building a model that fits training data is easy. The hard part is knowing whether what it learned is real. Financial ML is uniquely treacherous because:

Look-ahead leakage can produce arbitrarily good-looking results in backtests
Statistical significance and economic viability are different things
Markets change — past patterns don't guarantee future patterns
Transaction costs can completely erase theoretical alpha

In this project, we ran an adversarial validation that caught a leakage bug in an MLP model showing suspiciously high IC. That catch prevented expensive mistakes. This part documents exactly how that process works.

Key Metrics: What Each One Measures

Rank IC (Information Coefficient)

Rank IC — Rank Information Coefficient

The Spearman rank correlation between a model's predicted return rankings and the actual return rankings, computed cross-sectionally on a given date. If the model predicts stock A will outperform stock B, and it does, that's a correct ranking. Rank IC aggregates this across all stocks on a given day. Range: -1 to +1. Typical "good" Rank IC in production quant systems: +0.05 to +0.15.

Rank IC = Spearman_ρ(rank(predictions), rank(actual_returns))

Spearman_ρ = 1 − (6 · Σdᵢ²) / (n(n²−1))

where dᵢ = rank(prediction_i) − rank(actual_i)

from scipy.stats import spearmanr
import pandas as pd
import numpy as np

def compute_daily_rank_ic(predictions_df: pd.DataFrame) -> pd.Series:

"""
predictions_df: indexed by (date, ticker)
columns: 'predicted_ret20', 'actual_ret20'

Returns: Series of daily Rank ICs
"""
daily_ics = []
for date, group in predictions_df.groupby(level='date'):
if len(group)  0 days:    {(series > 0).mean():.1%}")
print(f"IC > 0.05 days: {(series > 0.05).mean():.1%}")
return series

Directional Accuracy (DirAcc)

Directional Accuracy

The proportion of predictions where the model correctly identified the direction of return (up vs. down). A random model achieves approximately 50%. Directional accuracy of 52–55% can be economically significant at scale. Above 60% is rare and usually indicates leakage.

Suspiciously high accuracy is a red flag

If you see directional accuracy above 58% or Rank IC above 0.15 on clean validation data, investigate for leakage before celebrating. True financial predictors are weak signals. Anything that looks too good to be true in finance almost always is.

Variance Ratio (revisited)

As established in Part 4: Var(predictions) / Var(actual returns). You want this above 0.02 at minimum. Below 0.01 = collapse, the model isn't differentiating stocks.

Quintile Spread

Divide predictions into 5 equal buckets (quintiles) from lowest to highest predicted return. Compute the actual return of the top quintile (Q5) minus the actual return of the bottom quintile (Q1). A positive quintile spread means the model correctly identifies outperformers vs. underperformers. This is the most practical metric for a long-short strategy.

def compute_quintile_spread(predictions_df: pd.DataFrame) -> pd.Series:

"""Compute Q5-Q1 quintile spread per date."""
spreads = []
for date, group in predictions_df.groupby(level='date'):
group = group.dropna()
if len(group)

Walk-Forward Validation

A single train/validation split is insufficient for financial models because it depends entirely on which time period you happened to select as validation. Walk-forward validation (also called time-series cross-validation) tests the model across multiple non-overlapping time windows.

def walk_forward_validation(df, model_fn, n_folds=4, min_train_days=500):

"""
Splits the timeline into n_folds sequential windows.
Each fold trains on all prior data and tests on the next window.
"""
all_dates = df.index.get_level_values('date').unique().sort_values()
fold_size = len(all_dates) // (n_folds + 1)
PURGE_DAYS = 21

results = []
for fold in range(n_folds):

# Expanding window training set
train_end_idx = fold_size * (fold + 1)
test_start_idx = train_end_idx + PURGE_DAYS
test_end_idx = test_start_idx + fold_size

train_dates = all_dates[:train_end_idx]
test_dates = all_dates[test_start_idx:test_end_idx]

if len(train_dates)

From our LightGBM validation (Part 7): 3 of 4 folds showed positive Rank IC. This is meaningful — the signal is not entirely regime-fragile. But 1 negative fold tells us there are market conditions where the model underperforms. That's normal and realistic.

The Leakage Audit: How We Caught the MLP Bug

An MLP (multi-layer perceptron) model trained on FEATURE_SET_B showed extraordinarily high IC — well above the plausible range for clean financial signals. Rather than celebrating, we ran an adversarial audit.

The four audit checks

Normalization timing check: Was the StandardScaler fit on training data only, or on the full dataset (train + validation)?

Finding: Scaler was fit on the full dataset. Mean and std were computed using future validation data. This leaked information about the validation period distribution into the normalized training features.

Feature selection provenance check: Was feature selection (e.g., mutual information, importance-based pruning) performed on the full dataset or within each training fold?

Finding: Feature selection used test.parquet to rank feature importance before training. Features correlated with validation-period returns were retained. This is classic look-ahead in feature selection.

Target construction check: Were forward returns (the targets) accidentally included anywhere in the feature set?

Finding: Clean in FEATURE_SET_B, but was the original root cause in an earlier iteration.

Date-level IC check: Do predictions show unrealistically high IC on specific dates, suggesting the model "knew" about those dates?

Finding: IC spikes on earnings dates where sentiment lags were not properly enforced.

The canonical leakage rule

The fundamental rule Every step that uses data to make a decision — normalization, feature selection, hyperparameter tuning, model training — must use only training-period data. The test set must be completely invisible until the final evaluation. Any step that even glimpses the test set contaminates the result.

def clean_feature_selection(train_df, val_df, feature_cols, n_features=15):

"""
Feature selection INSIDE training fold only.
Val data is NEVER accessed during this process.
"""
from sklearn.feature_selection import mutual_info_regression

X_train = train_df[feature_cols].values
y_train = train_df['ret20'].values

# Rank features by mutual information with target
# Using ONLY training data
mi_scores = mutual_info_regression(X_train, y_train)
feature_ranking = pd.Series(mi_scores, index=feature_cols).sort_values(ascending=False)

selected_features = feature_ranking.head(n_features).index.tolist()

# NOW we can apply the selection to val_df
# But only using the features chosen from train_df
return selected_features

The Economic Viability Wall

This is perhaps the most important lesson in the entire series: statistical signal ≠ tradable profit.

Our clean LightGBM model (FEATURE_SET_D_INVERTED) showed Rank IC ≈ +0.08. That's real signal. It survived walk-forward validation. But when we simulated actual trading performance including realistic transaction costs:

+0.0801 Rank IC (before costs)