Cross-Asset TFT Model: Mathematical Walkthrough and Data Flow

Date: 2026-04-23

This note explains how the repository's cross-asset Temporal Fusion Transformer-style model is formulated, how the data is sanitized, what one row looks like mathematically, how rows are assembled into a window, and how that window is processed by the network.

The implementation lives in:

1. Model Purpose

The model forecasts future log returns for one target ticker using both the target's own history and a cross-asset context basket. With the default configuration, the target is combined with 19 context ETFs/proxies:

Equities: SPY, QQQ, IWM, DIA
Rates: TLT, IEF, SHY
Credit: HYG, LQD
Commodities: GLD, SLV, USO, UNG
FX and dollar proxies: UUP, FXE, FXY, FXB, FXA
Volatility: VXX

The model predicts three horizons:

next day
next week, about 5 trading days
one month, about 21 trading days

This is a multi-horizon regression model, not a classifier.

2. Tensor Dimensions

Let:

A = number of assets in the input basket
F = number of engineered features per asset
L = sequence length
d = transformer hidden width

In the default config:

A = 20
F = 6
L = 252
d = 128

So the per-timestep feature width is

D = A * F = 20 * 6 = 120

and one model input sample has shape

X in R^(252 x 120)

At inference time the batch dimension is added:

X_batch in R^(1 x 252 x 120)

3. Raw Database Source

The input starts from market_data_daily, where the dataset loader reads only:

ticker
date
close
volume

For a target ticker j*, the code:

loads all chosen tickers
pivots close and volume into date x ticker matrices
keeps only dates where the target exists
forward-fills context tickers across the target's calendar
computes rolling features per ticker
drops rows with unresolved NaN

This behavior comes directly from model/tft/dataset.py.

4. Data Sanitization and Alignment

For each ticker j and date t, let:

C_t^(j) = close
V_t^(j) = volume

The loader sanitizes data in four important ways:

4.1 Calendar alignment

The target ticker defines the date index. Context assets are reindexed onto those target dates:

I_target = { t : C_t^(target) is observed }

Every other ticker is projected onto I_target.

4.2 Forward fill of context assets

If a context asset is missing on a target trading date, the last available observation is carried forward:

C_t^(j) <- C_(t')^(j) where t' < t is the most recent available date

This is a practical way to tolerate ETF holidays, delayed listings, and non-perfect calendar overlap.

4.3 Numerical cleanup

After feature engineering, the pipeline replaces +inf and -inf with NaN, then applies dropna().

So any row that still has unresolved rolling-window warmup gaps is removed before training or inference.

4.4 Positive-close safeguard

Any training anchor with last_close <= 0 is skipped before labels are created.

5. Feature Engineering Per Asset

For each asset j, the model constructs six features at each date t.

5.1 Price z-score over a 60-day rolling window

Let

mu_C,t^(j) = mean(C_(t-59:t)^(j))

sigma_C,t^(j) = std(C_(t-59:t)^(j))

Then

close_z_t^(j) = (C_t^(j) - mu_C,t^(j)) / (sigma_C,t^(j) + eps)

with eps = 1e-8.

5.2 Volume z-score over a 60-day rolling window

mu_V,t^(j) = mean(V_(t-59:t)^(j))

sigma_V,t^(j) = std(V_(t-59:t)^(j))

volume_z_t^(j) = (V_t^(j) - mu_V,t^(j)) / (sigma_V,t^(j) + eps)

5.3 One-day log return

ret1_t^(j) = log(C_t^(j) / C_(t-1)^(j))

5.4 Five-day log return

ret5_t^(j) = log(C_t^(j) / C_(t-5)^(j))

5.5 Twenty-day realized volatility of daily log returns

vol20_t^(j) = std(ret1_(t-19:t)^(j))

5.6 Distance from the 20-day moving average

Let

ma20_t^(j) = mean(C_(t-19:t)^(j))

Then

ma20_ratio_t^(j) = C_t^(j) / (ma20_t^(j) + eps) - 1

These are exactly the features named in model/tft/config.py.

6. One Row: Cross-Asset Feature Concatenation

For a given date t, each asset contributes a 6-dimensional feature vector

f_t^(j) in R^6

The full cross-asset row is the concatenation

x_t = [f_t^(j1) || f_t^(j2) || ... || f_t^(jA)] in R^(120)

In this repository, the ordering is:

target ticker first
context tickers in the configured order
within each ticker, feature order is close_z, volume_z, ret1, ret5, vol20, ma20_ratio

So a single date is not sent into the transformer as a scalar or a price. It is sent as a 120-dimensional vector that already fuses target and context information.

7. From Rows to a Training or Inference Window

The network never consumes one isolated row by itself. It consumes a length-252 sequence of rows:

X_t = [x_(t-251), x_(t-250), ..., x_t] in R^(252 x 120)

For training, each anchor date t produces one sample window and one 3-dimensional label vector.

For inference, the latest valid 252 rows are used.

For AAPL in the current database, the latest inference window spans:

window_start = 2025-04-23
window_end = 2026-04-23
trading_days = 252

8. Label Formulation

Let the target ticker be j*. If the window ends at time t, the labels are:

y_t = [y_(t,1), y_(t,5), y_(t,21)]

where

y_(t,1) = log(C_(t+1)^(j*) / C_t^(j*))

y_(t,5) = log(C_(t+5)^(j*) / C_t^(j*))

y_(t,21) = log(C_(t+21)^(j*) / C_t^(j*))

So the model is learning a mapping

f_theta : R^(252 x 120) -> R^3

with outputs:

next_day_return
next_week_return
one_month_return

9. Neural Architecture

The implementation is a compact TFT-style hybrid. It contains:

a feature gate over raw inputs
a linear projection into model space
a local LSTM
learned positional embeddings
stacked temporal self-attention blocks
gated residual feed-forward blocks
a final multi-output head

9.1 Feature gate

Let x_t in R^120. The first operation is an elementwise gate:

g_t = sigma(W_g x_t + b_g)

x'_t = x_t odot g_t

where sigma is the logistic sigmoid and odot is elementwise multiplication.

This allows the model to softly suppress or amplify individual input dimensions before projection.

9.2 Projection into the model dimension

e_t = W_p x'_t + b_p

with

e_t in R^128

Stacking this across time gives

E in R^(252 x 128)

9.3 Local recurrent encoder

The sequence is then processed by a one-layer LSTM:

H^(0) = LSTM(E)

This step is meant to capture local temporal structure before self-attention mixes distant timesteps.

9.4 Positional parameter

A learned positional tensor P in R^(252 x 128) is added:

H^(0) <- H^(0) + P

9.5 Temporal self-attention

For each attention layer ell, the model computes

A^(ell) = MHA(H^(ell-1), H^(ell-1), H^(ell-1))

because query, key, and value are all the same historical sequence.

Then it applies a residual connection and normalization:

Z^(ell) = LayerNorm(H^(ell-1) + Dropout(A^(ell)))

In the default config:

embed_dim = 128
num_heads = 4
head width = 128 / 4 = 32
number of attention layers = 2

Important implementation detail:

there is no causal mask in the attention call
there are no future rows in the input window
therefore each historical day can attend to every other historical day inside the same 252-day lookback

9.6 Gated residual network

After each attention block, the model applies a gated residual network:

u = W_2 phi(W_1 z + b_1) + b_2

gate(u) = sigma(W_3 u + b_3)

GRN(z) = LayerNorm(z + gate(u) odot u)

where phi is GELU by default.

This is the nonlinear feed-forward stage used after each attention block.

9.7 Last-timestep pooling and horizon head

After the final attention/GRN stack, only the representation of the last timestep is used:

h_* = H_final[252] in R^128

Then a final head maps R^128 -> R^3:

o = W_5 psi(W_4 h_* + b_4) + b_5

where psi is ReLU in the output head.

The three coordinates of o are interpreted as:

o_1 = next_day_return
o_2 = next_week_return
o_3 = one_month_return

10. Training Objective

For one sample, the loss is the sum of three mean-squared errors:

L_t(theta) = MSE(o_1, y_(t,1)) + MSE(o_2, y_(t,5)) + MSE(o_3, y_(t,21))

Over a minibatch B, the optimization target is

L_B(theta) = (1 / |B|) sum_(t in B) L_t(theta)

The training loop uses:

AdamW
cosine annealing learning rate schedule
early stopping on validation loss
mixed precision when CUDA is available

11. Database-Grounded Example

The following examples were queried from the live PostgreSQL database on 2026-04-23.

11.1 Raw rows for one date

For 2026-04-23, selected assets look like this in market_data_daily:

ticker	date	close	volume
AAPL	2026-04-23	274.42	4096623
QQQ	2026-04-23	653.25	4420553
SPY	2026-04-23	710.42	5484657
TLT	2026-04-23	86.95	1342258
VXX	2026-04-23	29.63	682982

This is still raw market data, not the model row.

11.2 Derived features for the same date

After the rolling transforms used by dataset.py, the same date becomes:

ticker	close_z	volume_z	ret1	ret5	vol20	ma20_ratio
AAPL	1.500540	-2.095325	0.004565	0.040986	0.016267	0.053276
QQQ	2.240282	-2.790993	-0.002843	0.019758	0.014143	0.069346
SPY	1.693107	-2.901096	-0.001111	0.012407	0.011457	0.048372
TLT	-0.538901	-2.121076	0.002418	0.007735	0.005546	0.003393
VXX	-0.402183	-2.228161	0.002704	0.011200	0.042506	-0.083868

If these five assets were the entire universe, the row would be:

x_t in R^(5 * 6) = R^30

But the actual default model uses 20 assets, so the real row width is:

x_t in R^120

11.3 Interpreting one actual row

For AAPL on 2026-04-23, the six numbers

[1.500540, -2.095325, 0.004565, 0.040986, 0.016267, 0.053276]

mean:

price is about 1.50 rolling standard deviations above its 60-day mean
volume is far below its 60-day average relative to its recent dispersion
the one-day log return is about 0.4565%
the five-day log return is about 4.0986%
twenty-day realized daily volatility is about 1.6267%
price is about 5.33% above its 20-day moving average

That 6-dimensional AAPL slice is only one contiguous block inside the full 120-dimensional cross-asset row.

12. Example Label Vector

Because labels need future prices, choose an anchor date that is at least 21 trading days before the database endpoint. For AAPL on 2026-03-23, the target labels are:

anchor_date	anchor_close	next_day_return	next_week_return	one_month_return
2026-03-23	251.49	0.000596	-0.019514	0.082691

So the supervised pair is conceptually:

X_(2026-03-23) in R^(252 x 120)

y_(2026-03-23) = [0.000596, -0.019514, 0.082691]

13. What Is Actually Passed Into the Transformer

This is the key conceptual point:

one database row becomes one cross-asset feature row x_t in R^120
one prediction sample is a stack of 252 such rows
the attention module operates across the 252 timesteps, not across raw SQL rows directly

So the computational path is:

market_data_daily rows

-> aligned close/volume matrices

-> per-asset rolling features

-> one date-level row x_t in R^120

-> one window X_t in R^(252 x 120)

-> feature gate

-> projection to R^128

-> LSTM over 252 timesteps

-> self-attention over 252 timesteps

-> GRN blocks

-> final timestep representation

-> 3 regression outputs

14. Why the Window Size Matters

The model sees only the last 252 trading days for any single prediction. It does not directly attend beyond that horizon during one forward pass.

However, the model is still trained over many years of history because the dataset creates many overlapping windows:

X_(t1), X_(t2), X_(t3), ...

Each training sample has a one-year lookback, but the collection of samples spans the full history in the database.

15. Practical Interpretation

This architecture is best viewed as:

cross-sectional fusion across assets inside each row
temporal modeling across 252 days inside the sequence
multi-horizon regression at the output

The LSTM handles local sequential structure. The attention layers let the final representation compare distant days within the lookback window. The gating stages act as soft feature selection.

16. Reproducibility Queries

Example raw data query:

SELECT ticker, date, close, volume
FROM market_data_daily
WHERE ticker IN ('AAPL','SPY','QQQ','TLT','VXX')
  AND date = DATE '2026-04-23'
ORDER BY ticker;

Example latest 252-day window span:

WITH d AS (
  SELECT date, ROW_NUMBER() OVER (ORDER BY date DESC) AS rn
  FROM market_data_daily
  WHERE ticker = 'AAPL'
)
SELECT MIN(date) AS window_start,
       MAX(date) AS window_end,
       COUNT(*)  AS trading_days
FROM d
WHERE rn <= 252;

17. Important Note

As of 2026-04-23, the checked-in model/tft/model.py file has a formatting corruption near the top of the file. The formulation above is based on the readable model body plus the training, dataset, config, export, and inference paths, which are consistent about the intended architecture.