The Core Question

When you have 20 years of daily price data, news sentiment, and hourly options chains, the central problem isn't collecting the data. It's teaching a model to find non-random relationships between the past and the future — relationships that are stable enough to trade.

Two architectural families dominate modern sequence modeling: transformers and diffusion models. Both can be applied to time series. They are not the same thing, and confusing them leads to architectural mistakes that are expensive to debug later.

Transformers: Direct Sequence-to-Prediction

A transformer is a neural network designed to model relationships between elements in a sequence. The key mechanism is attention — the model learns which past time steps are relevant to predicting the current or future time step.

Temporal Transformer

A transformer architecture where the sequence dimension represents time. Given a window of T past observations, the model outputs a prediction for time step T+1 (or multiple future steps). The attention mechanism allows each position to "look at" any other position in the window, learning which past patterns matter.

The standard transformer prediction path is direct:

Input: 60 days of [price, volume, sentiment, ...] features

↓

Positional encoding (inject temporal order)

↓

Multi-head self-attention (which days matter?)

↓

Feed-forward layers (non-linear transformation)

↓

Output: Predicted return for day 61

This is a one-shot prediction. You pass in the sequence and get a prediction back in a single forward pass. The model doesn't refine its answer — it commits immediately.

What attention actually computes

The attention mechanism computes a weighted sum of past values, where the weights are learned. For each position in the sequence, it asks: "how much should I attend to each other position?" High attention weight on day 45 means "the features 15 days ago matter a lot for this prediction." Low weight means "ignore it."

This is powerful for markets because relevant patterns are not always recent. A 200-day moving average cross-over from 60 days ago might matter more than yesterday's price. Attention can learn this.

Diffusion Models: Prediction Through Iterative Denoising

Diffusion models come from generative AI — image synthesis, audio generation. The core idea is different from transformers in a fundamental way.

Diffusion Model (forward process)

Start with clean data. Progressively add Gaussian noise over T steps until the data is indistinguishable from pure noise. The model learns to reverse this process — going from noise back to data.

Temporal Diffusion Transformer

A model that applies diffusion-based denoising to time-evolving sequences. Instead of predicting a target directly, it starts from a noisy estimate and iteratively refines it over multiple denoising steps, using a transformer to model the temporal context at each step.

The prediction path for a diffusion model looks like this:

Input: 60 days of context + random noise (the "noisy prediction")

↓

Step 1: Reduce noise slightly (model denoises)

↓

Step 2: Reduce noise again

↓

... (N denoising steps)

↓

Final: Clean prediction emerges from the noise

Why diffusion is computationally heavier

A standard transformer does one forward pass per prediction. A diffusion model does N forward passes (one per denoising step). For financial time series, N is typically 50–1000. This explains the observation in this project where the TFT/diffusion-style model ran at 70% GPU utilization on a 128GB GB10 system for many hours, while a standard transformer finished in minutes.

More compute is not always better. The question is whether the multi-step refinement produces meaningfully better predictions, and whether your compute budget allows it.

Important distinction

The Temporal Fusion Transformer (TFT) in this series is not a diffusion model. Despite the naming similarity, TFT is a deterministic sequence model with gating mechanisms, LSTM encoding, and attention — not iterative denoising. The "temporal diffusion transformer" is a separate research direction. We ran both. The TFT performed significantly better on this dataset.

Side-by-Side Comparison

Dimension	Standard Transformer (ST)	Diffusion-based Model	TFT
Prediction type	Direct, one-shot	Iterative denoising	Direct, multi-horizon
Compute per prediction	Low (1 forward pass)	High (N forward passes)	Medium
Temporal memory	Attention only	Attention + denoising	LSTM + attention
Feature selection	Implicit via attention	Implicit via attention	Explicit gating layer
Uncertainty modeling	No (point estimate)	Yes (stochastic)	Optional (quantile heads)
Good for finance	Medium — prone to collapse	Experimental	Yes — robust to noise

The Problem with "Data is Just Data"

At one point in building this system, the question came up: "if I'm just passing normalized numbers, does the architecture even matter? It's all just data."

This is a reasonable intuition that turns out to be wrong in an important way. The architecture is not just a container for data — it defines the inductive biases of what the model can and cannot learn efficiently.

What a standard transformer can't easily learn

Temporal persistence. If volatility was high yesterday and the day before, it will probably be elevated tomorrow. A pure attention model doesn't have a built-in mechanism to carry this "state" through time — it must learn it implicitly from weights, which is harder.
Context-dependent feature relevance. IV/HV ratio matters enormously during earnings. RSI is nearly useless during regime transitions. A standard transformer treats features the same way across all contexts unless explicitly taught otherwise.
Avoiding collapse. Without mechanisms that force the model to produce diverse outputs, transformers on noisy financial data tend to collapse toward predicting the mean of the training distribution. This is exactly what happened in our experiments.

What the TFT adds

Gated variable selection — learns which features matter at each time step
LSTM encoder — explicitly models temporal state and persistence
Gated residuals — stabilizes deep networks on noisy data
Multi-horizon output heads — predicts 5-day, 10-day, 20-day returns simultaneously

None of these are magic. Each is an engineering decision that addresses a specific failure mode of naive transformers on financial data. We'll examine each in detail in Part 5.

The Sliding Window: How Sequences Become Training Data

Regardless of architecture, all sequence models need their data formatted as windows. This is worth understanding precisely because it directly affects what your model can learn and whether it's cheating.

Creating windows from daily data

With a window size of 60 days, you create overlapping input-target pairs:

Window 1: days 1-60 → predict day 61

Window 2: days 2-61 → predict day 62

Window 3: days 3-62 → predict day 63

...

Window N: days (end-60) to end → predict (end+1)

For 20 years of daily data (~5000 trading days):

You get approximately 4940 training windows

Each window is an independent training example. The model sees window 1 and is told "the answer is day 61's return." It sees window 2 and is told "the answer is day 62's return." After thousands of windows, it learns to map 60-day patterns to next-day outcomes.

Critical

The model does not use its prediction from window 1 as input to window 2. Each window uses only real historical data. The "sliding" is about shifting the window forward in time, not about feeding predictions back into inputs.

What the model learns from a prediction

Each time the model makes a prediction, it compares to the real outcome and adjusts its internal weights (via gradient descent or Adam optimizer) to reduce the error. Repeat this for all 4,940 windows across multiple epochs, and the model's weights encode whatever stable relationship exists between 60-day history and next-day outcome.

If no stable relationship exists — or if the features you've given it don't capture it — the model defaults to predicting the mean return. This is called mean collapse, and it's the first major failure mode we encountered.

Positional Encoding: Teaching the Model About Time Order

Transformers process sequences in parallel — they don't inherently know which element came first. To fix this, we inject positional information.

Positional Encoding

A vector added to each input embedding that encodes its position in the sequence. The classic approach uses sine and cosine functions at different frequencies so that each position gets a unique, learnable-from encoding. The model can then differentiate "this is 60 days ago" from "this is yesterday."

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

pos = position in sequence (0 to 59 for a 60-day window)

i = dimension index

d_model = embedding dimension (e.g. 64, 128, 256)

In practice, most frameworks (PyTorch, HuggingFace) provide positional encoding utilities. You rarely implement this from scratch. But understanding what it does matters: without it, "prices on day 1" and "prices on day 60" are treated identically by the model.

When to Use Which Architecture

Scenario	Recommended	Reason
Quick prototype, limited compute	Standard Transformer or LightGBM	Fast iteration, interpretable failure modes
Multi-horizon financial prediction	TFT	Built for this — gating, LSTM, multi-head
Generative scenario simulation	Temporal Diffusion	Produces distributions, not point estimates
Tabular cross-sectional ranking	LightGBM / XGBoost	Outperforms deep learning on structured data
Noisy data, unknown regime	TFT with gating	Gating suppresses irrelevant features dynamically

Practical finding from this project

The architecture that worked was TFT — but the reason was not that it's a better algorithm in the abstract. It's that feature gating + LSTM temporal encoding specifically addressed the failure modes of the Standard Transformer on this dataset. The right architecture depends on your data's statistical properties.

Summary: What to Take Into Part 2

A temporal transformer maps a window of past observations directly to a prediction, using attention to identify relevant history.
A temporal diffusion model starts from noise and iteratively refines toward a prediction — more compute-intensive, produces uncertainty estimates.
The Temporal Fusion Transformer (TFT) is a distinct architecture combining variable selection gating, LSTM temporal encoding, and multi-horizon attention heads.
All architectures need data formatted as sliding windows with proper train/validation splits.
Architecture choice matters because each has different inductive biases — capabilities and blind spots that interact with the statistical properties of your data.

In Part 2, we build the data pipeline that feeds these architectures.

Series — Building a Quantitative Trading System

Time-Series Foundations: Transformers, Diffusion, and Why They're Different You are here