Project: LDTM — Long-Term Deep Temporal Model

One-Line Version

Built a GPU-accelerated ML system that trains 103 separate neural networks and holds every one of them accountable — every prediction written to the database before the market closes, every actual price filled back automatically when it arrives, every model required to earn its active status through a promotion gate before it touches production.

The Situation

The fundamental failure mode of most ML prediction systems isn't the model — it's the measurement. A model gets trained, a backtest gets run, the chart looks reasonable, and the work is declared done. There's no ongoing comparison between what was predicted and what actually happened. No way to know whether the model is improving, degrading, or was never credible. No accountability loop.

The goal here was to build a system where accountability was the architecture, not an afterthought. Every prediction would be on record before the market closed. When the actual price arrived later, the system would find that prediction, fill in the result, compute the error, and update a live accuracy leaderboard — automatically, every day, unattended. No manual evaluation. No cherry-picked results. Every model in the system earns its keep or gets retired.

The harder infrastructure challenge: 103 tickers, each with 20+ years of daily history, each requiring its own model trained from scratch, all running within a single evening cron window — while the same GPU serves a live LLM inference process running in parallel.

What I Built — Technical

Architecture: Deliberate Choices, Not Defaults

LSTM over Transformer — and why. Daily OHLCV sequences are auto-regressive; the LSTM hidden state carries an efficient O(n) rolling summary of historical regime. A Transformer at a 30-day window would need positional encoding and attention over 5,000+ daily bars for 20 years of training data — meaningful O(n²) memory overhead with no return at this sequence length. Flash Attention doesn't engage until sequences are long enough for IO-aware tiling to matter. The LSTM gating provides implicit attention at O(n) already. The choice was made and documented before training started, not rationalized afterward.

Per-ticker models over a shared model — and why. KLAC trades around $1,800; CPRT around $34. NVDA and SQQQ have inverted volatility regimes by construction. A shared model learns median NDX-100 behavior, suppressing the outlier momentum patterns that are most useful at the individual ticker level. 103 models, each expert in one ticker, outperform one model trying to be mediocre about all of them.

Architecture per model: 2-layer LSTM, hidden size 128, input size 11 (OHLCV + 6 engineered features), 30-day lookback window. Three independent prediction heads — next day, next Monday, one month — each a separate 128→64→1 tower. The heads are not shared because each horizon has a different optimal feature weighting: next-day is dominated by recent momentum and RSI; one-month is driven by moving average regime. Forcing a shared head to compromise across horizons loses resolution on each. Total: ~227K parameters per model. ~23.4M across the 103-ticker universe.

Feature Engineering and Normalization

The 11-feature input adds log returns at 1-day and 5-day horizons, RSI-14, and three moving average ratios (MA-5, MA-10, MA-20) to raw OHLCV.

The critical design decision is per-window min-max normalization rather than a global scaler. A global scaler fitted on AAPL training data from 2005 ($5/share) would produce out-of-range normalized values at 2026 inference prices ($270/share), causing cascading mispredictions. Per-window normalization rescales each 30-day window independently so the model learns relative movements within the window, not absolute price levels.

The 30-day window was chosen as approximately one earnings cycle and one options expiry — enough context for MA-20 and RSI convergence, short enough that the earliest prices in the window are unlikely to belong to a different macro regime.

Training uses AdamW with CosineAnnealingLR, FP16 Automatic Mixed Precision, and early stopping (patience=10). Loss is multi-head MSE in normalized space — quadratic penalty is deliberate, since a 10% prediction error is materially worse than ten 1% errors.

GPU-Aware Parallel Orchestrator

Built a custom orchestrator that queries nvidia-smi at startup, detects available GPUs and VRAM, computes concurrent slot counts (VRAM / 600 MiB per job), and fills a thread-safe token queue. Worker threads pull a GPU token, launch a Docker container scoped to that device, return the token on completion.

Two ordered waves: all training jobs first, then all inference jobs — so checkpoints exist before inference starts. After both waves, the orchestrator prints a structured summary table: ticker, GPU index, duration, status, val_loss.

On the NVIDIA GB10 (128GB unified memory), the orchestrator runs 4 parallel slots while coexisting with a 39GB Mistral-7B FP8 inference process on the same GPU. This wasn't luck — the slot count and the cron schedule (inference runs at 18:15, after Triton's daytime peak) were calculated to leave headroom. On a shared GPU, being a good tenant is an engineering requirement, not a courtesy.

Observed performance:

Full 103-ticker retrain: ~20 minutes wall clock
Average training time per model: ~52 seconds
GPU utilization during training: ~90%
VRAM per container: ~310 MiB (~250 MiB CUDA context, ~0.87 MiB model weights)
Inference across all 103 tickers: ~2 seconds wall clock
Per-ticker inference: 50ms checkpoint load · 80ms DB query and feature computation · 2ms GPU forward pass · 10ms DB write

Snapshot/Fillback — The Audit Architecture

This is the design decision that makes the system distinct.

Prediction and evaluation are separated by design. Every evening after market close, inference writes one snapshot row per ticker:

ticker | run_date | run_date_close | next_day_pred | next_monday_pred | one_month_pred | direction_pred

The fillback process runs in the same cron window and scans for any snapshot rows where actuals are still NULL and the target date has passed. It does a single batch SQL pull of all needed future prices, then fills back per row:

next_day_actual | next_day_pct_error | next_day_direction_actual | next_day_direction_correct
next_monday_actual | next_monday_pct_error
one_month_actual | one_month_pct_error

This creates a permanent, tamper-evident audit record. Every prediction the system ever made is on record with the date it was made, the price it predicted, and the actual price that materialized. A 30-day rolling accuracy view (ldtm_accuracy_30d) aggregates this into a live leaderboard by ticker. A dry-run mode shows pending fills without writing.

A prediction system without an accuracy audit is not a system — it is a one-time experiment. This architecture is the difference.

Model Registry with Promotion Gate

Formal model lifecycle: candidate → active → retired.

Before promotion to active status, a model must pass a multi-fold backtest gate that verifies:

Direction accuracy ≥ 52% on next-day horizon
Average absolute price error ≤ 5%
Zero leakage rows (test dates strictly after train cutoff)
Minimum 2 folds completed

Gate passes: checkpoint copied to a versioned filename ({ticker}_ldtm_v{major}.{trained_on_date}.pt), previous active model retired, new model promoted. Gate fails: model stays as candidate, notification fires with the specific failure reason. Version major numbers increment when architecture parameters change — making it immediately visible which models were retrained after a config change.

ONNX Export and TensorRT Path

ONNX export at opset 17 with dynamic batch dimension, plus a TensorRT engine build script. Validated with onnx.checker before the TRT build runs. This preserves a path to sub-millisecond inference if the system scales to a higher-frequency use case, without requiring a model rewrite.

What I Built — Leadership and Judgment

The system was designed to be measurable from the start, not as a feature added later. The snapshot/fillback architecture and the promotion gate are not things I added after the model was working — they are the frame the rest of the system was built around. A prediction system that doesn't measure itself against reality is an experiment dressed as a product. The governance controls here — audit records, leaderboards, promotion gates, versioned checkpoints — are the same controls you'd implement in a production ML system at a financial institution. Scale doesn't change whether the discipline is right.

Architecture decisions were made explicit and documented before training started. LSTM over Transformer, per-ticker over shared, per-window over global normalization — each decision is in the architecture file with specific reasoning, not retrofitted justification. This was written to guide future sessions and collaborators so the same questions don't have to be re-derived from first principles. Undocumented decisions become liabilities when teams or tooling change. That lesson comes from managing systems at scale where those liabilities showed up as outages.

Resource governance was designed in, not bolted on. The VRAM-aware slot calculation, the two-wave execution order, the cron scheduling around the LLM inference peak — these are decisions that reflect experience running shared infrastructure where being wrong costs someone else's availability. On a shared GPU, headroom is a deliverable.

Results

Metric	Value
Tickers in production	103 (NDX-100 + ETFs)
Training data per ticker	20+ years of daily OHLCV
Total parameters across universe	~23.4M (103 × ~227K)
Full 103-ticker retrain time	~20 minutes wall clock
Inference time (all 103 tickers)	~2 seconds wall clock
GPU utilization during training	~90% on NVIDIA GB10
VRAM per container	~310 MiB
Next-day directional accuracy (validation)	~52%
Models with full lifecycle governance	candidate → active → retired
Daily cron runs, unattended	Yes — evenings, weekdays
First live signal cohort (2026-04-22)	68 bullish / 26 neutral / 9 bearish across 103 tickers

The ~52% next-day directional accuracy is marginally above chance. The more important number is that the system measures it at all — every degradation surfaces in the accuracy leaderboard before a bad model stays in rotation undetected.

What This Shows About How I Work

I design systems with feedback loops built in from the start, not as an afterthought. The snapshot/fillback pattern and the promotion gate are the frame the rest of the system was built around, because a prediction system without an accuracy audit is not a system — it is a one-time experiment. The explicit architecture tradeoffs — LSTM vs Transformer, per-ticker vs shared, per-window normalization — reflect an approach where the reasoning is as much of the deliverable as the code. Decisions that can't be explained can't be safely revisited, and systems that outlast their original author need to carry their own rationale.

Technologies

Python 3.11 · PyTorch (LSTM, FP16 AMP, AdamW, CosineAnnealingLR, EarlyStopping) · NVIDIA GB10 Blackwell (128GB unified memory) · CUDA · NGC PyTorch container · Docker · Docker Compose · ThreadPoolExecutor (parallel GPU orchestrator) · nvidia-smi (runtime VRAM detection) · PostgreSQL 15 · psycopg2 · SQLAlchemy · Interactive Brokers TWS / ib_insync · ONNX (opset 17, dynamic batch) · TensorRT · Streamlit · Mistral-7B FP8 via Triton · Cron

Relevant For

Role	Why This Story Fits
Staff / Principal ML Engineer	End-to-end ML system design with audit architecture, model registry lifecycle, and documented architectural tradeoffs
AI / ML Infrastructure Engineer	GPU orchestration, VRAM-aware parallelism, NGC containers, ONNX/TRT export, LLM cohabitation on shared GPU
Head of AI / ML Platform	Snapshot/fillback as a governance pattern — permanent prediction audit, promotion gates, accuracy leaderboards
Quant / Algo Research	Multi-horizon regression, per-window normalization, per-ticker model design, daily directional accuracy tracking
Technical Founding Role	Production-grade discipline at research scale — versioned models, leakage detection, documented decisions, unattended daily pipeline