Why LightGBM?
After investing significant effort in the TFT, the discovery that LightGBM — a gradient boosting tree model — produced comparable or better clean Rank IC on this dataset was initially surprising. It shouldn't have been.
The tabular data advantage
For cross-sectional financial prediction, the input is fundamentally a table: each row is a (date, ticker) observation with ~30 engineered features. This is structured tabular data. On this kind of data, tree-based models have several structural advantages over deep learning:
| Property | LightGBM | Neural Networks (TFT, ST) |
|---|---|---|
| Feature interactions | Automatic (splits) | Requires large capacity + data |
| Handling non-linearity | Excellent (tree splits) | Good (but needs regularization) |
| Robustness to outliers | Very good (rank-based splits) | Sensitive (unless carefully handled) |
| Feature scaling needed? | No | Yes (normalized inputs) |
| Overfitting tendency | Low (with proper params) | Higher on small datasets |
| Training time | Minutes | Hours (TFT) |
| Interpretability | Feature importance | Partial (gate weights in TFT) |
The key insight from the literature and this project: transformers excel at sequence modeling when the temporal structure of the sequence itself is what carries information (language, audio, video). For cross-sectional equity ranking, the temporal structure is partially encoded in the engineered features (moving averages, momentum, volatility). Once those features exist, the prediction problem becomes "given these 30 numbers about this stock today, predict its 20-day return." That's a tabular problem, and trees are very good at it.
Feature Sets: The Experimental Framework
Rather than training one model with all features, we tested multiple feature sets systematically. Each "feature set" is a hypothesis about which signals contain predictive information.
What is FEATURE_SET_D_INVERTED?
"Inverted" refers to a key construction choice: instead of selecting features that are most correlated with returns (forward-looking), we select features based on their temporal stability within the training window — features that are consistently informative across the training period, not just correlated with the specific test period.
This inverts the typical selection criterion and reduces the risk of selecting features that happen to correlate with the validation period by chance.
def build_feature_set_d_inverted(train_df, val_df):
"""
Feature set D with inverted selection criterion.
Selection criterion: temporal consistency within training window,
not raw correlation with test targets.
"""
base_features = [
Trend
'price_vs_sma_7', 'price_vs_sma_14', 'price_vs_sma_50', 'price_vs_sma_200',
Momentum
'rsi_7', 'rsi_14',
Volume
'vol_ratio_5d', 'vol_ratio_20d', 'obv_trend_10d',
Volatility
'hvol_5d', 'hvol_20d', 'hvol_60d', 'vol_ratio', 'daily_range_vs_avg',
Options
'iv_30d', 'iv_hv_ratio', 'put_call_ratio', 'skew_25d',
Sentiment
'sentiment_score', 'sentiment_7d_ma',
Sector
'sector_avg_ret', 'alpha_vs_sector', 'sector_vol',
Seasonality
'month_sin', 'month_cos', 'earnings_season', ]
Split training data into 3 temporal chunks
A feature is "stable" if it has positive IC in at least 2/3 chunks
n_dates = train_df.index.get_level_values('date').nunique()
chunk_size = n_dates // 3
dates = train_df.index.get_level_values('date').unique().sort_values()
stable_features = []
for feat in base_features:
chunk_ics = []
for i in range(3):
chunk_dates = dates[i*chunk_size : (i+1)*chunk_size]
chunk = train_df[train_df.index.get_level_values('date').isin(chunk_dates)]
ic, _ = spearmanr(chunk[feat].fillna(0), chunk['ret20'].fillna(0))
chunk_ics.append(ic)
# Feature is stable if positive IC in 2 out of 3 chunks
if sum(1 for ic in chunk_ics if ic > 0) >= 2:
stable_features.append(feat)
print(f"Stable features: {len(stable_features)} / {len(base_features)}")
return stable_features
---
## LightGBM Configuration
import lightgbm as lgb
from scipy.stats import spearmanr
def train_lgbm_ranker(train_df, feature_cols, target_col='ret20'):
"""
LightGBM configured as a ranking model — directly optimizes
for cross-sectional ranking rather than return magnitude.
"""
# Group data by date for LambdaRank
# Each date is a "query" — rank stocks within each date
groups = train_df.groupby(level='date').size().values
params = {
X_train = train_df[feature_cols].values
# For lambdarank, labels should be relevance scores
# We convert returns to quintile ranks (0-4) as relevance
y_train = pd.qcut(train_df[target_col], q=5, labels=False).fillna(2)
model = lgb.LGBMRanker(**params)
model.fit(
X_train, y_train.values,
group=groups,
eval_set=[(X_train, y_train.values)], # Will be replaced with val in practice
eval_group=[groups],
)
return model
Why LambdaRank instead of MSE?
As noted in Part 4, MSE loss cares about prediction magnitude — how close is 0.05% predicted to 0.06% actual. LambdaRank directly optimizes the ranking quality (NDCG — Normalized Discounted Cumulative Gain). This aligns the training objective with what we actually care about: ranking stocks by expected return, not predicting exact magnitudes.
The False Alpha Catch: MLP / FEATURE_SET_B
The MLP model trained on FEATURE_SET_B showed Rank IC of approximately +0.18 on validation data. This should have been immediately suspicious — +0.18 is roughly 2x the typical IC of production quant models at top-tier firms.
The adversarial test
We ran a specific adversarial test: train the model using scrambled (randomly permuted) targets, and see if the IC remains high. If the model is genuinely learning the relationship between features and returns, IC should collapse to near-zero with shuffled targets. It didn't.
def adversarial_ic_test(model_fn, train_df, val_df, feature_cols, n_permutations=10):
"""
Shuffle target labels and check if model still achieves high IC.
If IC remains high with shuffled targets, the result is from leakage.
"""
# Real IC
real_model = model_fn(train_df, feature_cols)
real_ic = evaluate_rank_ic(real_model, val_df, feature_cols)
print(f"Real IC: {real_ic:.4f}")
# Shuffled IC distribution
shuffled_ics = []
for i in range(n_permutations):
shuffled_train = train_df.copy()
# Randomly permute targets within each date
shuffled_train['ret20'] = shuffled_train.groupby(level='date')['ret20'] \
.transform(lambda x: x.sample(frac=1).values)
shuffled_model = model_fn(shuffled_train, feature_cols)
ic = evaluate_rank_ic(shuffled_model, val_df, feature_cols)
shuffled_ics.append(ic)
print(f"Mean shuffled IC: {np.mean(shuffled_ics):.4f}")
print(f"Std shuffled IC: {np.std(shuffled_ics):.4f}")
print(f"Signal-to-noise: {real_ic / np.std(shuffled_ics):.2f}")
# If shuffled_ic is high, validation data has been contaminated
if np.mean(shuffled_ics) > 0.05:
print("WARNING: High IC with shuffled targets — likely leakage")
Result: MLP / FEATURE_SET_B maintained IC ≈ +0.12 even with completely randomized target labels. This is the definitive fingerprint of look-ahead leakage — the model is extracting information from the validation-period features that was baked in during the (contaminated) normalization step. Rejected.
The right mindset about rejection Rejecting MLP/FEATURE_SET_B was not a failure. It was the validation process working correctly. Every false positive caught before production is thousands of dollars (or more) in losses avoided. Document the failure mode. Move on with the clean candidate.
Step 7G: The Clean Candidate Validation
After identifying LightGBM / FEATURE_SET_D_INVERTED as the clean candidate, we ran the full Step 7G validation protocol:
- Rebuild feature selection inside each training fold (no test.parquet access)
- Train LightGBM on FEATURE_SET_D_INVERTED with correct normalization
- Walk-forward 4-fold validation with 21-day purge gaps
- 42026 forward OOS test on held-out data
- Date-level Rank IC computation
- Quintile spread analysis
- Portfolio simulation with transaction costs
- Comparison vs. momentum baseline, equal-weight, random
Results summary
The clean candidate is real but economically insufficient at high turnover. The path forward is strategy architecture — Part 8.
Feature Importance: What Drove the Signal
def analyze_feature_importance(model, feature_cols):
"""Extract and visualize LightGBM feature importance."""
importance_df = pd.DataFrame({
'feature': feature_cols,
'importance_gain': model.feature_importances_,
}).sort_values('importance_gain', ascending=False)
print(importance_df.head(15).to_string(index=False))
return importance_df
Top features by information gain (from the clean model):
| Rank | Feature | Category | Relative importance |
|---|---|---|---|
| 1 | iv_hv_ratio | Options | 100% |
| 2 | alpha_vs_sector | Sector | 81% |
| 3 | hvol_5d / hvol_60d ratio | Volatility | 74% |
| 4 | price_vs_sma_200 | Trend | 68% |
| 5 | sentiment_7d_ma | Sentiment | 57% |
| 6 | obv_trend_10d | Volume | 52% |
| 7 | price_vs_sma_50 | Trend | 48% |
| 8 | put_call_ratio | Options | 43% |
| 9 | sector_vol | Sector | 38% |
| 10 | earnings_season | Seasonality | 31% |
The IV/HV ratio dominance is consistent with prior research: implied volatility premium is one of the most reliably exploitable signals in equity options markets. The model independently discovered what options researchers have documented for decades.
Where This Project Stands
At the end of Part 7, the project is in this state:
- ✅ Clean predictive signal exists (LightGBM / FEATURE_SET_D_INVERTED, IC +0.08)
- ✅ Signal survived adversarial leakage audit
- ✅ Signal survived walk-forward validation (3/4 folds positive)
- ⚠️ Signal not currently economically viable at high turnover after costs
- ⚠️ No regime-gating applied yet (IC varies significantly by market regime)
- 🔲 No strategy mapping to specific trade structures
- 🔲 No live paper-trading validation
Part 8 is the architecture that converts the statistical signal into a tradable system.
Series — Building a Quantitative Trading System