Cross-Asset TFT Model: Mathematical Walkthrough and Data Flow
Date: 2026-04-23
This note explains how the repository's cross-asset Temporal Fusion Transformer-style model is formulated, how the data is sanitized, what one row looks like mathematically, how rows are assembled into a window, and how that window is processed by the network.
The implementation lives in:
- model/tft/config.py
- model/tft/dataset.py
- model/tft/model.py
- model/tft/trainer.py
- model/tft/predict.py
1. Model Purpose
The model forecasts future log returns for one target ticker using both the target's own history and a cross-asset context basket. With the default configuration, the target is combined with 19 context ETFs/proxies:
- Equities:
SPY,QQQ,IWM,DIA - Rates:
TLT,IEF,SHY - Credit:
HYG,LQD - Commodities:
GLD,SLV,USO,UNG - FX and dollar proxies:
UUP,FXE,FXY,FXB,FXA - Volatility:
VXX
The model predicts three horizons:
- next day
- next week, about 5 trading days
- one month, about 21 trading days
This is a multi-horizon regression model, not a classifier.
2. Tensor Dimensions
Let:
A= number of assets in the input basketF= number of engineered features per assetL= sequence lengthd= transformer hidden width
In the default config:
A = 20F = 6L = 252d = 128
So the per-timestep feature width is
D = A * F = 20 * 6 = 120
and one model input sample has shape
X in R^(252 x 120)
At inference time the batch dimension is added:
X_batch in R^(1 x 252 x 120)
3. Raw Database Source
The input starts from market_data_daily, where the dataset loader reads only:
tickerdateclosevolume
For a target ticker j*, the code:
- loads all chosen tickers
- pivots close and volume into
date x tickermatrices - keeps only dates where the target exists
- forward-fills context tickers across the target's calendar
- computes rolling features per ticker
- drops rows with unresolved
NaN
This behavior comes directly from model/tft/dataset.py.
4. Data Sanitization and Alignment
For each ticker j and date t, let:
C_t^(j)= closeV_t^(j)= volume
The loader sanitizes data in four important ways:
4.1 Calendar alignment
The target ticker defines the date index. Context assets are reindexed onto those target dates:
I_target = { t : C_t^(target) is observed }
Every other ticker is projected onto I_target.
4.2 Forward fill of context assets
If a context asset is missing on a target trading date, the last available observation is carried forward:
C_t^(j) <- C_(t')^(j) where t' < t is the most recent available date
This is a practical way to tolerate ETF holidays, delayed listings, and non-perfect calendar overlap.
4.3 Numerical cleanup
After feature engineering, the pipeline replaces +inf and -inf with NaN, then applies dropna().
So any row that still has unresolved rolling-window warmup gaps is removed before training or inference.
4.4 Positive-close safeguard
Any training anchor with last_close <= 0 is skipped before labels are created.
5. Feature Engineering Per Asset
For each asset j, the model constructs six features at each date t.
5.1 Price z-score over a 60-day rolling window
Let
mu_C,t^(j) = mean(C_(t-59:t)^(j))
sigma_C,t^(j) = std(C_(t-59:t)^(j))
Then
close_z_t^(j) = (C_t^(j) - mu_C,t^(j)) / (sigma_C,t^(j) + eps)
with eps = 1e-8.
5.2 Volume z-score over a 60-day rolling window
mu_V,t^(j) = mean(V_(t-59:t)^(j))
sigma_V,t^(j) = std(V_(t-59:t)^(j))
volume_z_t^(j) = (V_t^(j) - mu_V,t^(j)) / (sigma_V,t^(j) + eps)
5.3 One-day log return
ret1_t^(j) = log(C_t^(j) / C_(t-1)^(j))
5.4 Five-day log return
ret5_t^(j) = log(C_t^(j) / C_(t-5)^(j))
5.5 Twenty-day realized volatility of daily log returns
vol20_t^(j) = std(ret1_(t-19:t)^(j))
5.6 Distance from the 20-day moving average
Let
ma20_t^(j) = mean(C_(t-19:t)^(j))
Then
ma20_ratio_t^(j) = C_t^(j) / (ma20_t^(j) + eps) - 1
These are exactly the features named in model/tft/config.py.
6. One Row: Cross-Asset Feature Concatenation
For a given date t, each asset contributes a 6-dimensional feature vector
f_t^(j) in R^6
The full cross-asset row is the concatenation
x_t = [f_t^(j1) || f_t^(j2) || ... || f_t^(jA)] in R^(120)
In this repository, the ordering is:
- target ticker first
- context tickers in the configured order
- within each ticker, feature order is
close_z, volume_z, ret1, ret5, vol20, ma20_ratio
So a single date is not sent into the transformer as a scalar or a price. It is sent as a 120-dimensional vector that already fuses target and context information.
7. From Rows to a Training or Inference Window
The network never consumes one isolated row by itself. It consumes a length-252 sequence of rows:
X_t = [x_(t-251), x_(t-250), ..., x_t] in R^(252 x 120)
For training, each anchor date t produces one sample window and one 3-dimensional label vector.
For inference, the latest valid 252 rows are used.
For AAPL in the current database, the latest inference window spans:
window_start = 2025-04-23window_end = 2026-04-23trading_days = 252
8. Label Formulation
Let the target ticker be j*. If the window ends at time t, the labels are:
y_t = [y_(t,1), y_(t,5), y_(t,21)]
where
y_(t,1) = log(C_(t+1)^(j*) / C_t^(j*))
y_(t,5) = log(C_(t+5)^(j*) / C_t^(j*))
y_(t,21) = log(C_(t+21)^(j*) / C_t^(j*))
So the model is learning a mapping
f_theta : R^(252 x 120) -> R^3
with outputs:
next_day_returnnext_week_returnone_month_return
9. Neural Architecture
The implementation is a compact TFT-style hybrid. It contains:
- a feature gate over raw inputs
- a linear projection into model space
- a local LSTM
- learned positional embeddings
- stacked temporal self-attention blocks
- gated residual feed-forward blocks
- a final multi-output head
9.1 Feature gate
Let x_t in R^120. The first operation is an elementwise gate:
g_t = sigma(W_g x_t + b_g)
x'_t = x_t odot g_t
where sigma is the logistic sigmoid and odot is elementwise multiplication.
This allows the model to softly suppress or amplify individual input dimensions before projection.
9.2 Projection into the model dimension
e_t = W_p x'_t + b_p
with
e_t in R^128
Stacking this across time gives
E in R^(252 x 128)
9.3 Local recurrent encoder
The sequence is then processed by a one-layer LSTM:
H^(0) = LSTM(E)
This step is meant to capture local temporal structure before self-attention mixes distant timesteps.
9.4 Positional parameter
A learned positional tensor P in R^(252 x 128) is added:
H^(0) <- H^(0) + P
9.5 Temporal self-attention
For each attention layer ell, the model computes
A^(ell) = MHA(H^(ell-1), H^(ell-1), H^(ell-1))
because query, key, and value are all the same historical sequence.
Then it applies a residual connection and normalization:
Z^(ell) = LayerNorm(H^(ell-1) + Dropout(A^(ell)))
In the default config:
embed_dim = 128num_heads = 4- head width =
128 / 4 = 32 - number of attention layers =
2
Important implementation detail:
- there is no causal mask in the attention call
- there are no future rows in the input window
- therefore each historical day can attend to every other historical day inside the same 252-day lookback
9.6 Gated residual network
After each attention block, the model applies a gated residual network:
u = W_2 phi(W_1 z + b_1) + b_2
gate(u) = sigma(W_3 u + b_3)
GRN(z) = LayerNorm(z + gate(u) odot u)
where phi is GELU by default.
This is the nonlinear feed-forward stage used after each attention block.
9.7 Last-timestep pooling and horizon head
After the final attention/GRN stack, only the representation of the last timestep is used:
h_* = H_final[252] in R^128
Then a final head maps R^128 -> R^3:
o = W_5 psi(W_4 h_* + b_4) + b_5
where psi is ReLU in the output head.
The three coordinates of o are interpreted as:
o_1 = next_day_returno_2 = next_week_returno_3 = one_month_return
10. Training Objective
For one sample, the loss is the sum of three mean-squared errors:
L_t(theta) = MSE(o_1, y_(t,1)) + MSE(o_2, y_(t,5)) + MSE(o_3, y_(t,21))
Over a minibatch B, the optimization target is
L_B(theta) = (1 / |B|) sum_(t in B) L_t(theta)
The training loop uses:
AdamW- cosine annealing learning rate schedule
- early stopping on validation loss
- mixed precision when CUDA is available
11. Database-Grounded Example
The following examples were queried from the live PostgreSQL database on 2026-04-23.
11.1 Raw rows for one date
For 2026-04-23, selected assets look like this in market_data_daily:
| ticker | date | close | volume |
|---|---|---|---|
| AAPL | 2026-04-23 | 274.42 | 4096623 |
| QQQ | 2026-04-23 | 653.25 | 4420553 |
| SPY | 2026-04-23 | 710.42 | 5484657 |
| TLT | 2026-04-23 | 86.95 | 1342258 |
| VXX | 2026-04-23 | 29.63 | 682982 |
This is still raw market data, not the model row.
11.2 Derived features for the same date
After the rolling transforms used by dataset.py, the same date becomes:
| ticker | close_z | volume_z | ret1 | ret5 | vol20 | ma20_ratio |
|---|---|---|---|---|---|---|
| AAPL | 1.500540 | -2.095325 | 0.004565 | 0.040986 | 0.016267 | 0.053276 |
| QQQ | 2.240282 | -2.790993 | -0.002843 | 0.019758 | 0.014143 | 0.069346 |
| SPY | 1.693107 | -2.901096 | -0.001111 | 0.012407 | 0.011457 | 0.048372 |
| TLT | -0.538901 | -2.121076 | 0.002418 | 0.007735 | 0.005546 | 0.003393 |
| VXX | -0.402183 | -2.228161 | 0.002704 | 0.011200 | 0.042506 | -0.083868 |
If these five assets were the entire universe, the row would be:
x_t in R^(5 * 6) = R^30
But the actual default model uses 20 assets, so the real row width is:
x_t in R^120
11.3 Interpreting one actual row
For AAPL on 2026-04-23, the six numbers
[1.500540, -2.095325, 0.004565, 0.040986, 0.016267, 0.053276]
mean:
- price is about 1.50 rolling standard deviations above its 60-day mean
- volume is far below its 60-day average relative to its recent dispersion
- the one-day log return is about
0.4565% - the five-day log return is about
4.0986% - twenty-day realized daily volatility is about
1.6267% - price is about
5.33%above its 20-day moving average
That 6-dimensional AAPL slice is only one contiguous block inside the full 120-dimensional cross-asset row.
12. Example Label Vector
Because labels need future prices, choose an anchor date that is at least 21 trading days before the database endpoint. For AAPL on 2026-03-23, the target labels are:
| anchor_date | anchor_close | next_day_return | next_week_return | one_month_return |
|---|---|---|---|---|
| 2026-03-23 | 251.49 | 0.000596 | -0.019514 | 0.082691 |
So the supervised pair is conceptually:
X_(2026-03-23) in R^(252 x 120)
y_(2026-03-23) = [0.000596, -0.019514, 0.082691]
13. What Is Actually Passed Into the Transformer
This is the key conceptual point:
- one database row becomes one cross-asset feature row
x_t in R^120 - one prediction sample is a stack of 252 such rows
- the attention module operates across the 252 timesteps, not across raw SQL rows directly
So the computational path is:
market_data_daily rows
-> aligned close/volume matrices
-> per-asset rolling features
-> one date-level row x_t in R^120
-> one window X_t in R^(252 x 120)
-> feature gate
-> projection to R^128
-> LSTM over 252 timesteps
-> self-attention over 252 timesteps
-> GRN blocks
-> final timestep representation
-> 3 regression outputs
14. Why the Window Size Matters
The model sees only the last 252 trading days for any single prediction. It does not directly attend beyond that horizon during one forward pass.
However, the model is still trained over many years of history because the dataset creates many overlapping windows:
X_(t1), X_(t2), X_(t3), ...
Each training sample has a one-year lookback, but the collection of samples spans the full history in the database.
15. Practical Interpretation
This architecture is best viewed as:
- cross-sectional fusion across assets inside each row
- temporal modeling across 252 days inside the sequence
- multi-horizon regression at the output
The LSTM handles local sequential structure. The attention layers let the final representation compare distant days within the lookback window. The gating stages act as soft feature selection.
16. Reproducibility Queries
Example raw data query:
SELECT ticker, date, close, volume
FROM market_data_daily
WHERE ticker IN ('AAPL','SPY','QQQ','TLT','VXX')
AND date = DATE '2026-04-23'
ORDER BY ticker;
Example latest 252-day window span:
WITH d AS (
SELECT date, ROW_NUMBER() OVER (ORDER BY date DESC) AS rn
FROM market_data_daily
WHERE ticker = 'AAPL'
)
SELECT MIN(date) AS window_start,
MAX(date) AS window_end,
COUNT(*) AS trading_days
FROM d
WHERE rn <= 252;
17. Important Note
As of 2026-04-23, the checked-in model/tft/model.py file has a formatting corruption near the top of the file. The formulation above is based on the readable model body plus the training, dataset, config, export, and inference paths, which are consistent about the intended architecture.