Why Validation Is the Hardest Part
Building a model that fits training data is easy. The hard part is knowing whether what it learned is real. Financial ML is uniquely treacherous because:
- Look-ahead leakage can produce arbitrarily good-looking results in backtests
- Statistical significance and economic viability are different things
- Markets change — past patterns don't guarantee future patterns
- Transaction costs can completely erase theoretical alpha
In this project, we ran an adversarial validation that caught a leakage bug in an MLP model showing suspiciously high IC. That catch prevented expensive mistakes. This part documents exactly how that process works.
Key Metrics: What Each One Measures
Rank IC (Information Coefficient)
Rank IC = Spearman_ρ(rank(predictions), rank(actual_returns))
Spearman_ρ = 1 − (6 · Σdᵢ²) / (n(n²−1))
where dᵢ = rank(prediction_i) − rank(actual_i)
from scipy.stats import spearmanr
import pandas as pd
import numpy as np
def compute_daily_rank_ic(predictions_df: pd.DataFrame) -> pd.Series:
"""
predictions_df: indexed by (date, ticker)
columns: 'predicted_ret20', 'actual_ret20'
Returns: Series of daily Rank ICs
"""
daily_ics = []
for date, group in predictions_df.groupby(level='date'):
if len(group) 0 days: {(series > 0).mean():.1%}")
print(f"IC > 0.05 days: {(series > 0.05).mean():.1%}")
return series
Directional Accuracy (DirAcc)
Variance Ratio (revisited)
As established in Part 4: Var(predictions) / Var(actual returns). You want this above 0.02 at minimum. Below 0.01 = collapse, the model isn't differentiating stocks.
Quintile Spread
def compute_quintile_spread(predictions_df: pd.DataFrame) -> pd.Series:
"""Compute Q5-Q1 quintile spread per date."""
spreads = []
for date, group in predictions_df.groupby(level='date'):
group = group.dropna()
if len(group)
Walk-Forward Validation
A single train/validation split is insufficient for financial models because it depends entirely on which time period you happened to select as validation. Walk-forward validation (also called time-series cross-validation) tests the model across multiple non-overlapping time windows.
def walk_forward_validation(df, model_fn, n_folds=4, min_train_days=500):
"""
Splits the timeline into n_folds sequential windows.
Each fold trains on all prior data and tests on the next window.
"""
all_dates = df.index.get_level_values('date').unique().sort_values()
fold_size = len(all_dates) // (n_folds + 1)
PURGE_DAYS = 21
results = []
for fold in range(n_folds):
# Expanding window training set
train_end_idx = fold_size * (fold + 1)
test_start_idx = train_end_idx + PURGE_DAYS
test_end_idx = test_start_idx + fold_size
train_dates = all_dates[:train_end_idx]
test_dates = all_dates[test_start_idx:test_end_idx]
if len(train_dates)
From our LightGBM validation (Part 7): 3 of 4 folds showed positive Rank IC. This is meaningful — the signal is not entirely regime-fragile. But 1 negative fold tells us there are market conditions where the model underperforms. That's normal and realistic.
The Leakage Audit: How We Caught the MLP Bug
An MLP (multi-layer perceptron) model trained on FEATURE_SET_B showed extraordinarily high IC — well above the plausible range for clean financial signals. Rather than celebrating, we ran an adversarial audit.
The four audit checks
1
Normalization timing check: Was the StandardScaler fit on training data only, or on the full dataset (train + validation)?
Finding: Scaler was fit on the full dataset. Mean and std were computed using future validation data. This leaked information about the validation period distribution into the normalized training features.
2
Feature selection provenance check: Was feature selection (e.g., mutual information, importance-based pruning) performed on the full dataset or within each training fold?
Finding: Feature selection used test.parquet to rank feature importance before training. Features correlated with validation-period returns were retained. This is classic look-ahead in feature selection.
3
Target construction check: Were forward returns (the targets) accidentally included anywhere in the feature set?
Finding: Clean in FEATURE_SET_B, but was the original root cause in an earlier iteration.
4
Date-level IC check: Do predictions show unrealistically high IC on specific dates, suggesting the model "knew" about those dates?
Finding: IC spikes on earnings dates where sentiment lags were not properly enforced.
The canonical leakage rule
The fundamental rule Every step that uses data to make a decision — normalization, feature selection, hyperparameter tuning, model training — must use only training-period data. The test set must be completely invisible until the final evaluation. Any step that even glimpses the test set contaminates the result.
def clean_feature_selection(train_df, val_df, feature_cols, n_features=15):
"""
Feature selection INSIDE training fold only.
Val data is NEVER accessed during this process.
"""
from sklearn.feature_selection import mutual_info_regression
X_train = train_df[feature_cols].values
y_train = train_df['ret20'].values
# Rank features by mutual information with target
# Using ONLY training data
mi_scores = mutual_info_regression(X_train, y_train)
feature_ranking = pd.Series(mi_scores, index=feature_cols).sort_values(ascending=False)
selected_features = feature_ranking.head(n_features).index.tolist()
# NOW we can apply the selection to val_df
# But only using the features chosen from train_df
return selected_features
The Economic Viability Wall
This is perhaps the most important lesson in the entire series: statistical signal ≠ tradable profit.
Our clean LightGBM model (FEATURE_SET_D_INVERTED) showed Rank IC ≈ +0.08. That's real signal. It survived walk-forward validation. But when we simulated actual trading performance including realistic transaction costs:
+0.0801 Rank IC (before costs)