Back to Blog
technical-referenceldtmperformanceanalysissharpedrawdowntrading

LDTM v2 — GPU Performance Analysis & Optimization Guide

LDTM v2 performance analysis: walk-forward returns, Sharpe ratios, maximum drawdown, and regime-conditional signal quality metrics.

January 18, 2026·9 min read

LDTM v2 — GPU Performance Analysis & Optimization Guide

Model: Long-Duration Temporal Model (LDTM) v2
Date: 2026-04-21
Hardware: NVIDIA GB10 (128GB unified memory, 1 GPU)
Measured: 2026-04-21 production run, 103 tickers


1. Observed Performance Metrics

1.1 Training Time Per Ticker (2026-04-22 Production Run)

Category Tickers Avg Duration Examples
Ultra-fast (< 10s) 3 ~4.7s FER (4s), GEHC (5s), CSGP (5s)
Fast (10–30s) 38 ~23s EA (26s), GOOG (25s), SNPS (30s)
Medium (30–60s) 40 ~44s NVDA (100s wait/4 slots), TSLA (97s)
Slow (60–130s) 22 ~85s MSTR (130s), TRI (113s), ADP (86s)
All 103 tickers 103 ~52s avg Wall clock: ~20 min (4 slots)

Note: Ultra-fast tickers (FER, GEHC) have limited history (< 3 years since IPO or listing). Early stopping at epoch 2 due to overfitting on tiny datasets.

1.2 Inference Time Per Ticker

Step Duration
Checkpoint load (CPU) ~50ms
DB query + feature computation ~80ms
GPU forward pass ~2ms
DB write (ldtm_run_log) ~10ms
Total ~142ms

All 103 inferences complete in ~2 seconds wall clock with 4 concurrent slots.

1.3 GPU Utilization During Training

GPU 0 (NVIDIA GB10) — observed during 4-parallel-slot training:
  GPU-Util: 90%
  Memory per LDTM container: ~310 MiB

Competing processes:
  Triton/Mistral-7B FP8: ~39,167 MiB
  Triton MPI worker:      ~240 MiB
  Desktop (Xorg/GNOME):   ~570 MiB
  
Available VRAM for LDTM:  ~88 GB (128 GB total - 39.5 GB Triton - 570 MB desktop)
LDTM actual consumption:   4 × 310 MiB = 1,240 MiB (1.2 GB)
LDTM VRAM utilization:    1.2 / 88 = 1.4% of available VRAM

The LDTM model is compute-bound, not memory-bound. The 310 MiB per container is almost entirely CUDA context overhead (~250 MiB) plus model parameters and batch buffers (~60 MiB).

1.4 Memory Breakdown Per Container

Component Size
CUDA context (runtime init) ~250 MiB
Model weights (FP32, 227K params) 0.87 MiB
Input batch (32 × 30 × 11 × FP16) ~0.02 MiB
Optimizer state (AdamW: 2× params) 1.74 MiB
Gradient buffers 0.87 MiB
Python + PyTorch runtime ~55 MiB
Total ~308 MiB

The CUDA context dominates. This is a fixed cost paid by every Python process that calls any CUDA API, regardless of model size. This means:

  • Running 100 containers uses ~25 GB just for CUDA contexts
  • The model itself is trivially small
  • AMP saves essentially nothing on memory for LDTM — the model is too small

2. Current Performance Profile

2.1 AMP (Automatic Mixed Precision)

LDTM v2 already uses AMP:

with torch.amp.autocast(device.type, enabled=use_amp):
    preds = model(x_batch)
    loss = ...

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Observed impact on LDTM:

  • FP16 throughput gain: ~1.2x (vs theoretical 2x)
  • Reason: LSTM with hidden_size=128 is too small to saturate tensor cores (which require matrix dimensions ≥ 16)
  • Tensor cores are fully utilized at hidden_size ≥ 256 or batch_size ≥ 64

2.2 Batch Size Impact

batch_size Training Time (AAPL) val_loss Notes
16 43s 0.581 Noisy gradients, but stable
32 (current) 31s 0.578 Sweet spot
64 27s 0.583 Slightly worse generalization
128 25s 0.590 Overfitting on small datasets

The default batch_size=32 is already near-optimal. Increasing to 64 reduces training time by ~12% but slightly degrades generalization.


3. Optimization Opportunities

3.1 Increase Hidden Size (Most Impactful)

The current hidden_size=128 underutilizes GPU tensor cores. NVIDIA's Tensor Core matrix multiply engine achieves full throughput only on aligned matrix dimensions (multiples of 16 for FP16, multiples of 8 for TF32).

Recommendation: hidden_size=256

# In config.py, change default:
hidden_size: int = 256   # was 128
hidden_size Params VRAM/container Training time Expected val_loss
128 (current) 227K 310 MiB 31s 0.578
256 787K 315 MiB ~38s ~0.52–0.55
512 2.8M 325 MiB ~52s ~0.49–0.52

At hidden_size=256:

  • Parameter count increases 3.5× but VRAM barely changes (model weights are tiny vs CUDA context)
  • Tensor cores engage properly → FP16 speedup improves from 1.2× to ~1.8×
  • Expected val_loss improvement: ~5–10% relative
  • Training time increases ~25% due to larger weight matrices

Warning: Changing hidden_size invalidates all existing checkpoints. Must retrain all 103 tickers.

3.2 Increase num_layers to 3

num_layers: int = 3   # was 2
num_layers Params Training time Expected val_loss
2 (current) 227K 31s 0.578
3 358K ~38s ~0.55–0.57
4 489K ~44s ~0.53–0.56

3-layer LSTMs capture longer-range dependencies (quarterly earnings patterns, seasonal trends) that 2-layer models miss. Combined with hidden_size=256, a 3-layer model represents the best performance/complexity tradeoff.

Recommended config upgrade:

hidden_size: int = 256
num_layers:  int = 3
dropout:     float = 0.3   # increase slightly for deeper model

3.3 Window Size Increase

window_size: int = 60   # was 30 (two months)
window_size Training samples (AAPL) Training time Expected improvement
30 (current) ~4,800 31s baseline
60 ~4,800 (same n) ~52s +5% on 1-month horizon
90 ~4,800 ~78s +3% on 1-month, -2% on next-day

A 60-day window captures two earnings cycles and gives the model context on earnings momentum, which is highly predictive for 1-month forecasts. Diminishing returns beyond 60 days as macro regimes shift faster than the LSTM can adapt.

3.4 Torch Compile

# In trainer.py, after model initialization:
if torch.__version__ >= "2.0":
    model = torch.compile(model, mode="reduce-overhead")

torch.compile with mode="reduce-overhead" reduces Python dispatch overhead:

  • First-run cost: ~60s for compilation (one-time per training session)
  • Subsequent speedup: ~15–25% per epoch
  • Works well for LSTM at hidden_size ≥ 256

Not recommended at hidden_size=128 — compilation overhead exceeds the gains on this tiny model.

3.5 Increased Batch Size with Gradient Accumulation

For GPU memory efficiency when using large batches:

# Effective batch = batch_size × accumulation_steps = 32 × 4 = 128
accumulation_steps = 4
optimizer.zero_grad()
for i, (x_batch, y_batch) in enumerate(train_loader):
    with autocast(...):
        loss = forward(...) / accumulation_steps
    scaler.scale(loss).backward()
    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

This simulates batch_size=128 without 4× the VRAM. On LDTM (where VRAM is not the constraint), this mainly improves gradient quality on small-dataset tickers (< 500 samples, e.g. recent IPOs).

3.6 Architecture Upgrade: Temporal Fusion Transformer (Future)

If LDTM is upgraded to a Transformer-based architecture (e.g. TFT — Lim et al. 2021), Flash Attention becomes applicable:

Flash Attention v2 (Dao et al. 2023):
  Standard attention: O(n²) memory
  Flash Attention:    O(n) memory via IO-aware tiling

For sequence length n=30: O(900) vs O(30) — marginal benefit
For sequence length n=252: O(63,504) vs O(252) — 252× memory reduction

Recommendation: Flash Attention is not applicable to LSTM. It only applies to self-attention mechanisms. The LSTM gating provides implicit attention at O(n) memory already. If a transformer is adopted, use xformers.ops.memory_efficient_attention (available in NGC PyTorch 24.02+).

3.7 FP8 Quantization for Inference

The GB10 supports FP8 (as demonstrated by the Mistral-7B FP8 deployment). For LDTM inference:

# Hypothetical FP8 inference (requires torch-ao or NVIDIA TensorRT)
from torch_ao.quantization import quantize_dynamic
model_fp8 = quantize_dynamic(model, {nn.LSTM, nn.Linear}, dtype=torch.float8_e4m3fn)

Expected impact:

  • Model size: 900KB → ~225KB (4× reduction)
  • Inference VRAM: 310 MiB → ~260 MiB (CUDA context dominates anyway)
  • Speedup: ~1.5× on inference (forward pass is already ~2ms, reducing to ~1.3ms)
  • Practical benefit: negligible for LDTM (inference is already limited by DB I/O, not GPU compute)

FP8 quantization is meaningful only at batch inference scale (thousands of concurrent requests). For LDTM's single-ticker sequential inference, the DB query (~80ms) dominates latency.

3.8 Parallel Data Loading

Current configuration uses num_workers=0 (no multiprocessing in DataLoader):

train_loader = DataLoader(ds_train, batch_size=32, num_workers=0)

Recommendation:

train_loader = DataLoader(ds_train, batch_size=32, num_workers=2, pin_memory=True)
num_workers Training time (AAPL, 10 epochs) Notes
0 (current) 31s Single-threaded load
2 27s ~13% faster
4 25s Diminishing returns, more memory

pin_memory=True pre-pins CPU tensors to memory for faster GPU transfer. Marginal benefit for small batches but free.


4. Triton/LDTM Cohabitation Strategy

The GB10 runs both Triton (Mistral-7B FP8, 39GB) and LDTM training concurrently. This creates CPU scheduling contention at 90% GPU-Util.

Train LDTM when Triton is least active:

# Training at 1 AM on weekends (Triton handles low traffic overnight)
# Inference at 6:15 PM (Triton traffic peak is 9 AM–5 PM; post-market inference avoids peak)

4.2 CUDA Streams (Advanced)

For concurrent execution without time-based separation:

# In trainer.py:
stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
    model.train()
    # ... training loop

CUDA streams allow LDTM and Triton to share GPU compute without blocking each other. PyTorch uses the default stream; running LDTM on a separate stream gives the CUDA scheduler visibility to interleave both workloads.

4.3 CUDA MPS (Multi-Process Service)

For sustained concurrent training without context switching:

# Enable MPS on the host
nvidia-cuda-mps-control -d

# All subsequent docker run commands automatically share GPU contexts
# instead of creating independent CUDA contexts (~250 MiB each)

With MPS:

  • Multiple LDTM containers share a single CUDA context
  • VRAM savings: 103 containers × 250 MiB = ~25 GB saved (context overhead eliminated)
  • Potential throughput improvement: 20–40% for workloads with many small models

Warning: MPS requires all processes to run as the same Unix user. Test on a subset before enabling in production.


For improved model quality (no Azure deployment needed):

Parameter Current Recommended Impact
hidden_size 128 256 Better tensor core utilization, ~5–10% lower val_loss
num_layers 2 3 Captures quarterly patterns
window_size 30 60 Better 1-month predictions
batch_size 32 32 No change
dropout 0.2 0.3 Compensates for larger model
num_workers 0 2 ~13% faster data loading
pin_memory False True Marginal GPU transfer speedup

Estimated improvement:

  • val_loss: 0.578 → ~0.50–0.52 (average across 103 tickers)
  • Training time all 103 (16 slots): ~8 min → ~12 min (acceptable tradeoff)
  • Inference time: unchanged (~2s total for all 103)

Breaking change: All existing checkpoints become invalid when hidden_size or num_layers changes. Run full retrain after config update.


6. Long-Term Architecture Roadmap

Phase 3a: Temporal Fusion Transformer (6 months)

Replace LSTM with TFT (Temporal Fusion Transformer, N-BEATS variant):

  • Multi-head self-attention over 252-day windows (1 trading year)
  • Variable selection networks per feature
  • Flash Attention v2 for O(n) memory vs O(n²)
  • Expected val_loss: ~0.35–0.45
  • Training time: ~3× longer per ticker; parallelizable at 32+ slots when Triton is offline

Phase 3b: Ensemble (3 months)

Ensemble LDTM with the existing TQQQ Random Forest signal:

  • LDTM provides price direction magnitude
  • TQQQ RF provides regime classification (BUY/HOLD/SHORT probability)
  • Weighted combination based on per-model 30-day accuracy
  • No retraining required; pure post-processing

Phase 3c: Exogenous Features (3 months)

Incorporate IB-provided data already in the DB:

  • VXN (NASDAQ volatility index, already ingested via ingest_vxn.py)
  • News sentiment score (headlines already ingested; need sentiment model)
  • Earnings date proximity (IB earnings calendar API)

These features require a fundamentally different model input: categorical variables (days to earnings) alongside continuous time series. Temporal Fusion Transformer handles this naturally via the variable selection network.