The Core Question
When you have 20 years of daily price data, news sentiment, and hourly options chains, the central problem isn't collecting the data. It's teaching a model to find non-random relationships between the past and the future — relationships that are stable enough to trade.
Two architectural families dominate modern sequence modeling: transformers and diffusion models. Both can be applied to time series. They are not the same thing, and confusing them leads to architectural mistakes that are expensive to debug later.
Transformers: Direct Sequence-to-Prediction
A transformer is a neural network designed to model relationships between elements in a sequence. The key mechanism is attention — the model learns which past time steps are relevant to predicting the current or future time step.
The standard transformer prediction path is direct:
This is a one-shot prediction. You pass in the sequence and get a prediction back in a single forward pass. The model doesn't refine its answer — it commits immediately.
What attention actually computes
The attention mechanism computes a weighted sum of past values, where the weights are learned. For each position in the sequence, it asks: "how much should I attend to each other position?" High attention weight on day 45 means "the features 15 days ago matter a lot for this prediction." Low weight means "ignore it."
This is powerful for markets because relevant patterns are not always recent. A 200-day moving average cross-over from 60 days ago might matter more than yesterday's price. Attention can learn this.
Diffusion Models: Prediction Through Iterative Denoising
Diffusion models come from generative AI — image synthesis, audio generation. The core idea is different from transformers in a fundamental way.
The prediction path for a diffusion model looks like this:
Why diffusion is computationally heavier
A standard transformer does one forward pass per prediction. A diffusion model does N forward passes (one per denoising step). For financial time series, N is typically 50–1000. This explains the observation in this project where the TFT/diffusion-style model ran at 70% GPU utilization on a 128GB GB10 system for many hours, while a standard transformer finished in minutes.
More compute is not always better. The question is whether the multi-step refinement produces meaningfully better predictions, and whether your compute budget allows it.
Side-by-Side Comparison
| Dimension | Standard Transformer (ST) | Diffusion-based Model | TFT |
|---|---|---|---|
| Prediction type | Direct, one-shot | Iterative denoising | Direct, multi-horizon |
| Compute per prediction | Low (1 forward pass) | High (N forward passes) | Medium |
| Temporal memory | Attention only | Attention + denoising | LSTM + attention |
| Feature selection | Implicit via attention | Implicit via attention | Explicit gating layer |
| Uncertainty modeling | No (point estimate) | Yes (stochastic) | Optional (quantile heads) |
| Good for finance | Medium — prone to collapse | Experimental | Yes — robust to noise |
The Problem with "Data is Just Data"
At one point in building this system, the question came up: "if I'm just passing normalized numbers, does the architecture even matter? It's all just data."
This is a reasonable intuition that turns out to be wrong in an important way. The architecture is not just a container for data — it defines the inductive biases of what the model can and cannot learn efficiently.
What a standard transformer can't easily learn
- Temporal persistence. If volatility was high yesterday and the day before, it will probably be elevated tomorrow. A pure attention model doesn't have a built-in mechanism to carry this "state" through time — it must learn it implicitly from weights, which is harder.
- Context-dependent feature relevance. IV/HV ratio matters enormously during earnings. RSI is nearly useless during regime transitions. A standard transformer treats features the same way across all contexts unless explicitly taught otherwise.
- Avoiding collapse. Without mechanisms that force the model to produce diverse outputs, transformers on noisy financial data tend to collapse toward predicting the mean of the training distribution. This is exactly what happened in our experiments.
What the TFT adds
- Gated variable selection — learns which features matter at each time step
- LSTM encoder — explicitly models temporal state and persistence
- Gated residuals — stabilizes deep networks on noisy data
- Multi-horizon output heads — predicts 5-day, 10-day, 20-day returns simultaneously
None of these are magic. Each is an engineering decision that addresses a specific failure mode of naive transformers on financial data. We'll examine each in detail in Part 5.
The Sliding Window: How Sequences Become Training Data
Regardless of architecture, all sequence models need their data formatted as windows. This is worth understanding precisely because it directly affects what your model can learn and whether it's cheating.
Creating windows from daily data
With a window size of 60 days, you create overlapping input-target pairs:
Window 1: days 1-60 → predict day 61
Window 2: days 2-61 → predict day 62
Window 3: days 3-62 → predict day 63
...
Window N: days (end-60) to end → predict (end+1)
For 20 years of daily data (~5000 trading days):
You get approximately 4940 training windows
Each window is an independent training example. The model sees window 1 and is told "the answer is day 61's return." It sees window 2 and is told "the answer is day 62's return." After thousands of windows, it learns to map 60-day patterns to next-day outcomes.
What the model learns from a prediction
Each time the model makes a prediction, it compares to the real outcome and adjusts its internal weights (via gradient descent or Adam optimizer) to reduce the error. Repeat this for all 4,940 windows across multiple epochs, and the model's weights encode whatever stable relationship exists between 60-day history and next-day outcome.
If no stable relationship exists — or if the features you've given it don't capture it — the model defaults to predicting the mean return. This is called mean collapse, and it's the first major failure mode we encountered.
Positional Encoding: Teaching the Model About Time Order
Transformers process sequences in parallel — they don't inherently know which element came first. To fix this, we inject positional information.
pos = position in sequence (0 to 59 for a 60-day window)
i = dimension index
d_model = embedding dimension (e.g. 64, 128, 256)
In practice, most frameworks (PyTorch, HuggingFace) provide positional encoding utilities. You rarely implement this from scratch. But understanding what it does matters: without it, "prices on day 1" and "prices on day 60" are treated identically by the model.
When to Use Which Architecture
| Scenario | Recommended | Reason |
|---|---|---|
| Quick prototype, limited compute | Standard Transformer or LightGBM | Fast iteration, interpretable failure modes |
| Multi-horizon financial prediction | TFT | Built for this — gating, LSTM, multi-head |
| Generative scenario simulation | Temporal Diffusion | Produces distributions, not point estimates |
| Tabular cross-sectional ranking | LightGBM / XGBoost | Outperforms deep learning on structured data |
| Noisy data, unknown regime | TFT with gating | Gating suppresses irrelevant features dynamically |
Summary: What to Take Into Part 2
- A temporal transformer maps a window of past observations directly to a prediction, using attention to identify relevant history.
- A temporal diffusion model starts from noise and iteratively refines toward a prediction — more compute-intensive, produces uncertainty estimates.
- The Temporal Fusion Transformer (TFT) is a distinct architecture combining variable selection gating, LSTM temporal encoding, and multi-horizon attention heads.
- All architectures need data formatted as sliding windows with proper train/validation splits.
- Architecture choice matters because each has different inductive biases — capabilities and blind spots that interact with the statistical properties of your data.
In Part 2, we build the data pipeline that feeds these architectures.
Series — Building a Quantitative Trading System
01
Time-Series Foundations: Transformers, Diffusion, and Why They're Different You are here