Why LightGBM?

After investing significant effort in the TFT, the discovery that LightGBM — a gradient boosting tree model — produced comparable or better clean Rank IC on this dataset was initially surprising. It shouldn't have been.

The tabular data advantage

Gradient Boosting

An ensemble method that builds a sequence of weak learners (typically shallow decision trees), where each new tree corrects the errors of the previous ensemble. LightGBM (Light Gradient Boosting Machine) is a highly optimized implementation that uses histogram-based binning and leaf-wise tree growth to achieve high speed and accuracy on large tabular datasets.

For cross-sectional financial prediction, the input is fundamentally a table: each row is a (date, ticker) observation with ~30 engineered features. This is structured tabular data. On this kind of data, tree-based models have several structural advantages over deep learning:

Property	LightGBM	Neural Networks (TFT, ST)
Feature interactions	Automatic (splits)	Requires large capacity + data
Handling non-linearity	Excellent (tree splits)	Good (but needs regularization)
Robustness to outliers	Very good (rank-based splits)	Sensitive (unless carefully handled)
Feature scaling needed?	No	Yes (normalized inputs)
Overfitting tendency	Low (with proper params)	Higher on small datasets
Training time	Minutes	Hours (TFT)
Interpretability	Feature importance	Partial (gate weights in TFT)

The key insight from the literature and this project: transformers excel at sequence modeling when the temporal structure of the sequence itself is what carries information (language, audio, video). For cross-sectional equity ranking, the temporal structure is partially encoded in the engineered features (moving averages, momentum, volatility). Once those features exist, the prediction problem becomes "given these 30 numbers about this stock today, predict its 20-day return." That's a tabular problem, and trees are very good at it.

Feature Sets: The Experimental Framework

Rather than training one model with all features, we tested multiple feature sets systematically. Each "feature set" is a hypothesis about which signals contain predictive information.

| Feature Set | Description | Result |

FEATURE_SET_ABasic OHLCV + simple MAsIC ≈ 0.02, collapsed FEATURE_SET_BA + sentiment + optionsIC ≈ 0.18 (leaked) FEATURE_SET_CB but with correct lagIC ≈ 0.04, marginal FEATURE_SET_DFull set, proper constructionIC ≈ 0.06 FEATURE_SET_D_INVERTEDD with feature selection inside foldsIC ≈ 0.08, clean

What is FEATURE_SET_D_INVERTED?

"Inverted" refers to a key construction choice: instead of selecting features that are most correlated with returns (forward-looking), we select features based on their temporal stability within the training window — features that are consistently informative across the training period, not just correlated with the specific test period.

This inverts the typical selection criterion and reduces the risk of selecting features that happen to correlate with the validation period by chance.

def build_feature_set_d_inverted(train_df, val_df):

"""
Feature set D with inverted selection criterion.
Selection criterion: temporal consistency within training window,
not raw correlation with test targets.
"""
base_features = [

Trend

'price_vs_sma_7', 'price_vs_sma_14', 'price_vs_sma_50', 'price_vs_sma_200',

Momentum

'rsi_7', 'rsi_14',

Volume

'vol_ratio_5d', 'vol_ratio_20d', 'obv_trend_10d',

Volatility

'hvol_5d', 'hvol_20d', 'hvol_60d', 'vol_ratio', 'daily_range_vs_avg',

Options

'iv_30d', 'iv_hv_ratio', 'put_call_ratio', 'skew_25d',

Sentiment

'sentiment_score', 'sentiment_7d_ma',

Sector

'sector_avg_ret', 'alpha_vs_sector', 'sector_vol',

Seasonality

'month_sin', 'month_cos', 'earnings_season', ]

Split training data into 3 temporal chunks

A feature is "stable" if it has positive IC in at least 2/3 chunks

n_dates = train_df.index.get_level_values('date').nunique()
chunk_size = n_dates // 3

dates = train_df.index.get_level_values('date').unique().sort_values()
stable_features = []

for feat in base_features:
chunk_ics = []
for i in range(3):
chunk_dates = dates[i*chunk_size : (i+1)*chunk_size]
chunk = train_df[train_df.index.get_level_values('date').isin(chunk_dates)]

ic, _ = spearmanr(chunk[feat].fillna(0), chunk['ret20'].fillna(0))
chunk_ics.append(ic)

# Feature is stable if positive IC in 2 out of 3 chunks
if sum(1 for ic in chunk_ics if ic > 0) >= 2:

stable_features.append(feat)

print(f"Stable features: {len(stable_features)} / {len(base_features)}")
return stable_features

---

## LightGBM Configuration

import lightgbm as lgb
from scipy.stats import spearmanr

def train_lgbm_ranker(train_df, feature_cols, target_col='ret20'):

"""
LightGBM configured as a ranking model — directly optimizes
for cross-sectional ranking rather than return magnitude.

"""
# Group data by date for LambdaRank
# Each date is a "query" — rank stocks within each date
groups = train_df.groupby(level='date').size().values

params = {

'objective': 'lambdarank', # Rank optimization

'metric': 'ndcg', # Ranking quality metric 'ndcg_eval_at': [5, 10, 20], # Evaluate NDCG at top 5, 10, 20 'learning_rate': 0.05, # Conservative LR for stability 'num_leaves': 31, # Controls tree complexity 'max_depth': 6, # Prevents very deep trees 'min_child_samples': 50, # Prevents overfitting on small groups 'subsample': 0.8, # Row subsampling (bagging) 'colsample_bytree': 0.8, # Feature subsampling 'reg_alpha': 0.1, # L1 regularization 'reg_lambda': 1.0, # L2 regularization 'n_estimators': 500, # Number of trees 'early_stopping_rounds': 50, # Stop if no improvement 'verbose': -1, }

X_train = train_df[feature_cols].values

# For lambdarank, labels should be relevance scores
# We convert returns to quintile ranks (0-4) as relevance
y_train = pd.qcut(train_df[target_col], q=5, labels=False).fillna(2)

model = lgb.LGBMRanker(**params)

model.fit(
X_train, y_train.values,
group=groups,
eval_set=[(X_train, y_train.values)],  # Will be replaced with val in practice
eval_group=[groups],

)
return model

Why LambdaRank instead of MSE?

As noted in Part 4, MSE loss cares about prediction magnitude — how close is 0.05% predicted to 0.06% actual. LambdaRank directly optimizes the ranking quality (NDCG — Normalized Discounted Cumulative Gain). This aligns the training objective with what we actually care about: ranking stocks by expected return, not predicting exact magnitudes.

The False Alpha Catch: MLP / FEATURE_SET_B

The MLP model trained on FEATURE_SET_B showed Rank IC of approximately +0.18 on validation data. This should have been immediately suspicious — +0.18 is roughly 2x the typical IC of production quant models at top-tier firms.

The adversarial test

We ran a specific adversarial test: train the model using scrambled (randomly permuted) targets, and see if the IC remains high. If the model is genuinely learning the relationship between features and returns, IC should collapse to near-zero with shuffled targets. It didn't.

def adversarial_ic_test(model_fn, train_df, val_df, feature_cols, n_permutations=10):

"""
Shuffle target labels and check if model still achieves high IC.
If IC remains high with shuffled targets, the result is from leakage.
"""
# Real IC
real_model = model_fn(train_df, feature_cols)
real_ic = evaluate_rank_ic(real_model, val_df, feature_cols)
print(f"Real IC: {real_ic:.4f}")

# Shuffled IC distribution
shuffled_ics = []
for i in range(n_permutations):
shuffled_train = train_df.copy()

# Randomly permute targets within each date
shuffled_train['ret20'] = shuffled_train.groupby(level='date')['ret20'] \

.transform(lambda x: x.sample(frac=1).values)

shuffled_model = model_fn(shuffled_train, feature_cols)
ic = evaluate_rank_ic(shuffled_model, val_df, feature_cols)

shuffled_ics.append(ic)

print(f"Mean shuffled IC: {np.mean(shuffled_ics):.4f}")
print(f"Std shuffled IC: {np.std(shuffled_ics):.4f}")
print(f"Signal-to-noise: {real_ic / np.std(shuffled_ics):.2f}")

# If shuffled_ic is high, validation data has been contaminated
if np.mean(shuffled_ics) > 0.05:
print("WARNING: High IC with shuffled targets — likely leakage")

Result: MLP / FEATURE_SET_B maintained IC ≈ +0.12 even with completely randomized target labels. This is the definitive fingerprint of look-ahead leakage — the model is extracting information from the validation-period features that was baked in during the (contaminated) normalization step. Rejected.

The right mindset about rejection Rejecting MLP/FEATURE_SET_B was not a failure. It was the validation process working correctly. Every false positive caught before production is thousands of dollars (or more) in losses avoided. Document the failure mode. Move on with the clean candidate.

Step 7G: The Clean Candidate Validation

After identifying LightGBM / FEATURE_SET_D_INVERTED as the clean candidate, we ran the full Step 7G validation protocol:

Rebuild feature selection inside each training fold (no test.parquet access)
Train LightGBM on FEATURE_SET_D_INVERTED with correct normalization
Walk-forward 4-fold validation with 21-day purge gaps

42026 forward OOS test on held-out data

Date-level Rank IC computation
Quintile spread analysis
Portfolio simulation with transaction costs
Comparison vs. momentum baseline, equal-weight, random

Results summary

MetricLightGBM/D_INVMomentum baselineEqual-weight

Mean Rank IC+0.0801+0.043≈ 0 IC t-stat2.31.1— Positive IC folds3 / 42 / 4— Gross return (sim)+5–8 bps/trade+3–5 bps/trade0 After costs (–15 bps RT)–2 to –3 bps–8 bps0

The clean candidate is real but economically insufficient at high turnover. The path forward is strategy architecture — Part 8.

Feature Importance: What Drove the Signal

def analyze_feature_importance(model, feature_cols):

"""Extract and visualize LightGBM feature importance."""
importance_df = pd.DataFrame({

'feature': feature_cols,
'importance_gain': model.feature_importances_,
}).sort_values('importance_gain', ascending=False)

print(importance_df.head(15).to_string(index=False))
return importance_df

Top features by information gain (from the clean model):

Rank	Feature	Category	Relative importance
1	iv_hv_ratio	Options	100%
2	alpha_vs_sector	Sector	81%
3	hvol_5d / hvol_60d ratio	Volatility	74%
4	price_vs_sma_200	Trend	68%
5	sentiment_7d_ma	Sentiment	57%
6	obv_trend_10d	Volume	52%
7	price_vs_sma_50	Trend	48%
8	put_call_ratio	Options	43%
9	sector_vol	Sector	38%
10	earnings_season	Seasonality	31%

The IV/HV ratio dominance is consistent with prior research: implied volatility premium is one of the most reliably exploitable signals in equity options markets. The model independently discovered what options researchers have documented for decades.

Where This Project Stands

At the end of Part 7, the project is in this state:

✅ Clean predictive signal exists (LightGBM / FEATURE_SET_D_INVERTED, IC +0.08)
✅ Signal survived adversarial leakage audit
✅ Signal survived walk-forward validation (3/4 folds positive)
⚠️ Signal not currently economically viable at high turnover after costs
⚠️ No regime-gating applied yet (IC varies significantly by market regime)
🔲 No strategy mapping to specific trade structures
🔲 No live paper-trading validation

Part 8 is the architecture that converts the statistical signal into a tradable system.

Series — Building a Quantitative Trading System

01Foundations

02Data Pipeline 03Feature Engineering 04Standard Transformer 05Temporal Fusion Transformer 06Validation Discipline 07LightGBM as Signal EngineYou are here