LDTM v2 — Operations Runbook
Model: Long-Duration Temporal Model (LDTM) v2
Date: 2026-04-21
System: NVIDIA GB10 DGX, Ubuntu, Docker, PostgreSQL 15
Quick Reference
| Task | Command |
|---|---|
| Run inference — single ticker | bash model/ldtm/run_ldtm.sh --ticker AAPL --mode infer |
| Run inference — all tickers | bash model/ldtm/run_ldtm.sh --all --mode infer --parallel 16 |
| Full retrain — all tickers | bash model/ldtm/run_ldtm.sh --all --mode train --parallel 16 |
| Initialize DB tables | bash model/ldtm/run_ldtm.sh --init-db |
| Apply snapshot schema | See §3.1 |
| Start dashboard | docker compose --profile dashboard up -d dashboard |
| LLM query | docker compose --profile llm run --rm llm-query --group mega_cap --question "..." |
| Install cron jobs | bash schedule/install_cron.sh |
Prerequisites
1. System Requirements
| Component | Requirement |
|---|---|
| OS | Ubuntu 22.04+ |
| Docker | 24.x+ with NVIDIA Container Toolkit |
| GPU | NVIDIA GB10 (or any CUDA 12.x GPU) |
| RAM | 32 GB+ host RAM |
| Disk | 50 GB+ for model weights + Docker images |
| PostgreSQL | Running as trading-postgres container |
| IB Gateway | Running at TWS_HOST:TWS_PORT (for data ingestion) |
2. Verify Prerequisites
# Docker running
docker info | grep "Server Version"
# GPU accessible
nvidia-smi
# PostgreSQL healthy
docker exec trading-postgres pg_isready -U postgres -d trading
# .env file present
ls -la .env
# Required env vars
grep -E "DB_HOST|DB_NAME|DB_USER|DB_PASSWORD" .env
3. Required .env Keys
DB_HOST=localhost
DB_PORT=5432
DB_NAME=trading
DB_USER=postgres
DB_PASSWORD=<your_password>
TWS_HOST=<ib_gateway_host>
TWS_PORT=7497 # 7497=paper, 7496=live
Part 1: First-Time Setup
1.1 Build the LDTM Docker Image
cd /home/aimikamirai/projects/dgx-trading-system
docker build -t model-ldtm ./model/ldtm
Expected: ~5-10 minutes (downloads NGC PyTorch base image ~22GB on first run).
Verify:
docker images model-ldtm
# model-ldtm latest <sha> <date> 22.1GB
1.2 Build Supporting Images
docker build -t trading-dashboard ./dashboard
docker build -t ldtm-llm-query ./llm
Both are fast (~2 minutes, python:3.11-slim base).
1.3 Initialize Database Tables
# Create ldtm_run_log table
bash model/ldtm/run_ldtm.sh --init-db
# Create ldtm_daily_snapshots table + accuracy view
docker exec -i trading-postgres psql -U postgres -d trading \
< model/ldtm/snapshots_schema.sql
Verify:
docker exec trading-postgres psql -U postgres -d trading \
-c "\dt ldtm_*"
# Should list: ldtm_run_log, ldtm_daily_snapshots
docker exec trading-postgres psql -U postgres -d trading \
-c "\dv ldtm_*"
# Should list: ldtm_accuracy_30d
1.4 Install Cron Jobs
bash schedule/install_cron.sh
Verify:
crontab -l | grep ldtm
# Should show 4 entries: infer, blob_export, canary_retrain, monthly_retrain
Part 2: Running the Model
2.1 Single Ticker — Train + Infer
# Train only
bash model/ldtm/run_ldtm.sh --ticker AAPL --mode train
# Infer only (requires existing checkpoint)
bash model/ldtm/run_ldtm.sh --ticker AAPL --mode infer
# Both (train then infer)
bash model/ldtm/run_ldtm.sh --ticker AAPL --mode both
Expected output:
[run_ldtm] mode=both tickers=1 parallel=1 epochs=100
[run_ldtm] ── Ticker 1/1: AAPL ──
[LDTM] Training AAPL window=30 hidden=128 layers=2
[LDTM] Device: cuda AMP: True
[LDTM] Dataset loaded in 0.8s train=3358 val=720
[LDTM] epoch 1/100 train_loss=0.542318 val_loss=0.621045 lr=1.00e-03
...
[LDTM] Early stop — best val_loss=0.577952 at epoch 10
[LDTM] Checkpoint → /model_weights/ldtm/AAPL_ldtm.pt
{"ticker": "AAPL", "next_day_close": 268.55, ...}
2.2 Named Group
bash model/ldtm/run_ldtm.sh --group mega_cap --mode both --parallel 4
Groups available: mega_cap semis software internet consumer healthcare industrial telecom financials hardware intl etfs
2.3 All 103 Tickers
# Full train + infer (first run or monthly retrain)
bash model/ldtm/run_ldtm.sh --all --mode both --parallel 16
# Daily inference only (after checkpoints exist)
bash model/ldtm/run_ldtm.sh --all --mode infer --parallel 16
Expected timing (GB10, 4 slots, Triton competing):
- Training all 103: ~20 minutes wall clock
- Inference all 103: ~3 minutes wall clock
- Training all 103 (16 slots, no Triton): ~8 minutes
2.4 Snapshot Pipeline (after inference)
# Write today's predictions to ldtm_daily_snapshots
docker run --rm --network host --env-file .env \
-v "$(pwd)/model/ldtm:/app" -w /app \
python:3.11-slim \
sh -c "pip install psycopg2-binary sqlalchemy -q && python snapshot_writer.py"
# Fill actuals for past predictions
docker run --rm --network host --env-file .env \
-v "$(pwd)/model/ldtm:/app" -w /app \
python:3.11-slim \
sh -c "pip install psycopg2-binary sqlalchemy pandas -q && python snapshot_fillback.py"
Or using docker-compose (after docker compose build):
docker compose --profile ldtm-tools run --rm ldtm-snapshot-writer
docker compose --profile ldtm-tools run --rm ldtm-snapshot-fillback
2.5 Backfill Historical Snapshots
If running for the first time with existing ldtm_run_log data:
docker run --rm --network host --env-file .env \
-v "$(pwd)/model/ldtm:/app" -w /app \
python:3.11-slim \
sh -c "pip install psycopg2-binary sqlalchemy -q && python snapshot_writer.py --backfill"
Part 3: Dashboard
3.1 Start Dashboard (Local)
docker compose --profile dashboard up -d dashboard
Access at: http://localhost:8501
3.2 Stop Dashboard
docker compose --profile dashboard down
3.3 Check Dashboard Logs
docker logs trading-dashboard --tail 50
3.4 Rebuild Dashboard (after code changes)
docker compose --profile dashboard build dashboard
docker compose --profile dashboard up -d dashboard
Part 4: LLM Query Interface
4.1 Basic Query
docker compose --profile llm run --rm llm-query \
--question "What is the overall market signal today?"
4.2 Ticker-Specific Query
docker compose --profile llm run --rm llm-query \
--ticker NVDA \
--question "What does the model predict for NVDA this week and what is the recent accuracy?"
4.3 Group Query
docker compose --profile llm run --rm llm-query \
--group semis \
--question "Which semiconductor stocks have the strongest momentum?"
4.4 Custom Ticker List
docker compose --profile llm run --rm llm-query \
--tickers "NVDA AMD AMAT MU MRVL" \
--question "Rank these by 1-month implied upside"
4.5 All Tickers Market Summary
docker compose --profile llm run --rm llm-query \
--all \
--question "Summarize the market in 5 bullet points based on today's predictions"
Note: --all passes all 103 tickers as context. This uses ~15K tokens and is near the Mistral-7B context limit. Use groups for faster, cleaner responses.
Part 5: Querying the Database
5.1 Today's Predictions (all tickers, ranked by 1-month return)
SELECT s.ticker, s.run_date, s.run_date_close,
s.next_day_close_pred,
s.one_month_close_pred,
ROUND((((s.one_month_close_pred / NULLIF(s.next_day_close_pred,0)) - 1) * 100)::numeric, 2)
AS implied_1m_return_pct
FROM ldtm_daily_snapshots s
WHERE s.run_date = (SELECT MAX(run_date) FROM ldtm_daily_snapshots)
ORDER BY implied_1m_return_pct DESC NULLS LAST;
5.2 Accuracy Leaderboard
SELECT * FROM ldtm_accuracy_30d LIMIT 20;
5.3 Training Run History
SELECT ticker, DATE(run_at) AS run_date, best_val_loss, epochs_run, duration_sec
FROM ldtm_run_log
WHERE mode = 'train' AND status = 'success'
ORDER BY run_at DESC
LIMIT 20;
5.4 Ticker Prediction History (with actuals)
SELECT run_date, next_day_close_pred, next_day_actual,
next_day_pct_error, next_day_direction_correct
FROM ldtm_daily_snapshots
WHERE ticker = 'AAPL'
ORDER BY run_date DESC
LIMIT 30;
5.5 Failed Runs
SELECT ticker, mode, DATE(run_at) AS run_date, error_msg
FROM ldtm_run_log
WHERE status = 'failed'
ORDER BY run_at DESC
LIMIT 20;
5.6 Model Health Check
SELECT ticker, best_val_loss, epochs_run
FROM (
SELECT DISTINCT ON (ticker) ticker, best_val_loss, epochs_run
FROM ldtm_run_log
WHERE mode = 'train' AND status = 'success'
ORDER BY ticker, run_at DESC
) latest
ORDER BY best_val_loss ASC;
Part 6: Maintenance
6.1 Force Retrain a Single Ticker
# Remove old checkpoint first (optional — training overwrites it anyway)
docker run --rm -v trading_model_weights:/model_weights \
busybox rm -f /model_weights/ldtm/AAPL_ldtm.pt
# Retrain
bash model/ldtm/run_ldtm.sh --ticker AAPL --mode train
6.2 Monthly Full Retrain (manual trigger)
bash schedule/run_ldtm_monthly_retrain.sh
# Note: this checks date guard; override by calling run_ldtm.sh directly:
bash model/ldtm/run_ldtm.sh --all --mode train --parallel 16
6.3 Check Checkpoint Sizes
docker run --rm -v trading_model_weights:/model_weights \
busybox ls -lh /model_weights/ldtm/ | head -20
# Each .pt file should be ~900KB (227K params × 4 bytes FP32)
6.4 Verify Snapshot Fill-Back Is Working
After 2+ days of inference runs:
docker exec trading-postgres psql -U postgres -d trading -c "
SELECT ticker, run_date, next_day_close_pred, next_day_actual,
next_day_direction_correct, next_day_pct_error
FROM ldtm_daily_snapshots
WHERE next_day_actual IS NOT NULL
ORDER BY run_date DESC, ticker
LIMIT 10;
"
6.5 Rebuild Model Image
After code changes to model/ldtm/*.py:
docker build -t model-ldtm ./model/ldtm
# All checkpoints remain valid unless config.py changes (input_size, hidden_size, etc.)
# If LDTMConfig changes: delete old checkpoints and retrain
6.6 Disk Cleanup
# Remove old Docker images
docker image prune -f
# Check volume size
docker system df
# Check model weights volume
docker run --rm -v trading_model_weights:/model_weights \
busybox du -sh /model_weights/ldtm
# Expected: ~103 × 0.9MB ≈ 93MB
Part 7: Troubleshooting
T1: "No trained model at /model_weights/ldtm/AAPL_ldtm.pt"
Cause: Checkpoint doesn't exist (first run, or volume was reset).
Fix:
bash model/ldtm/run_ldtm.sh --ticker AAPL --mode train
T2: Model Predicts Wildly Wrong Prices
Cause: Insufficient data history in market_data_daily for this ticker.
Diagnosis:
docker exec trading-postgres psql -U postgres -d trading -c "
SELECT ticker, COUNT(*) AS rows, MIN(date), MAX(date)
FROM market_data_daily
WHERE ticker IN ('NFLX','BKNG','STX')
GROUP BY ticker ORDER BY rows ASC;
"
Fix: Trigger a full history backfill for the affected ticker:
docker compose --profile ingest run --rm -e TICKER=NFLX ingestion-ticker
T3: val_loss stuck above 1.0
Cause: Ticker has high inherent volatility (MSTR, TSLA, DDOG) or very short history.
Action: This is expected for volatile tickers. Consider:
- Increasing
--epochs 200for more training iterations - The direction signal is still meaningful even if absolute error is high
T4: Cron Jobs Not Running
Diagnosis:
crontab -l
grep ldtm /var/log/syslog | tail -20
cat schedule/logs/ldtm_infer.log | tail -20
Fix:
bash schedule/install_cron.sh
T5: Dashboard Shows "No snapshot data"
Cause: ldtm_daily_snapshots is empty or snapshot_writer.py hasn't run yet.
Fix:
# Backfill from existing ldtm_run_log data
docker run --rm --network host --env-file .env \
-v "$(pwd)/model/ldtm:/app" -w /app python:3.11-slim \
sh -c "pip install psycopg2-binary sqlalchemy -q && python snapshot_writer.py --backfill"
T6: LLM Query Returns "Connection refused"
Cause: Triton/Mistral-7B server is not running (it's a separate process, not managed by this project).
Diagnosis:
curl -s http://localhost:8000/v1/models
# Should return {"object":"list","data":[{"id":"engine-fp8",...}]}
Workaround: Set LLM_BASE_URL to Anthropic or OpenAI endpoint:
# In .env:
LLM_BASE_URL=https://api.anthropic.com/v1
LLM_API_KEY=sk-ant-...
LLM_MODEL=claude-sonnet-4-6
T7: PostgreSQL Connection Refused
# Check container is running
docker ps | grep trading-postgres
# If not running:
docker compose up -d postgres
# If running but not accepting connections:
docker exec trading-postgres pg_isready -U postgres
docker logs trading-postgres --tail 20
Part 8: Azure Deployment (When Ready)
8.1 Prerequisites
# Install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
az login
8.2 One-Time Setup
# Create resource group + storage
az group create --name trading-dashboard --location eastus2
az storage account create --name dgxtradingdata \
--resource-group trading-dashboard \
--sku Standard_LRS \
--allow-blob-public-access true
az storage container create --name trading-snapshots \
--account-name dgxtradingdata \
--public-access blob
# Get and save connection string
az storage account show-connection-string \
--name dgxtradingdata \
--resource-group trading-dashboard \
--query connectionString -o tsv
# → Add to .env as AZURE_BLOB_CONN_STR=...
# Create Container Registry
az acr create --name dgxtradingdash \
--resource-group trading-dashboard \
--sku Basic
# Create Container App environment
az containerapp env create \
--name trading-env \
--resource-group trading-dashboard \
--location eastus2
8.3 Deploy Dashboard
# Login to registry
az acr login --name dgxtradingdash
# Build + push
docker build -t dgxtradingdash.azurecr.io/dashboard:latest ./dashboard
docker push dgxtradingdash.azurecr.io/dashboard:latest
# Deploy
az containerapp create \
--name trading-dashboard \
--resource-group trading-dashboard \
--environment trading-env \
--image dgxtradingdash.azurecr.io/dashboard:latest \
--target-port 8501 \
--ingress external \
--min-replicas 0 \
--max-replicas 1 \
--set-env-vars \
DATA_SOURCE=blob \
"AZURE_BLOB_URL=https://dgxtradingdata.blob.core.windows.net/trading-snapshots"
8.4 Update Dashboard Image
docker build -t dgxtradingdash.azurecr.io/dashboard:latest ./dashboard
az acr login --name dgxtradingdash
docker push dgxtradingdash.azurecr.io/dashboard:latest
az containerapp update \
--name trading-dashboard \
--resource-group trading-dashboard \
--image dgxtradingdash.azurecr.io/dashboard:latest
Appendix A: Environment Variables Reference
| Variable | Required | Default | Description |
|---|---|---|---|
| DB_HOST | Yes | — | PostgreSQL host |
| DB_PORT | No | 5432 | PostgreSQL port |
| DB_NAME | Yes | — | Database name |
| DB_USER | Yes | — | DB username |
| DB_PASSWORD | Yes | — | DB password |
| TWS_HOST | Yes (ingest) | — | IB Gateway host |
| TWS_PORT | Yes (ingest) | — | IB Gateway port |
| LLM_BASE_URL | No | http://localhost:8000/v1 | LLM API endpoint |
| LLM_API_KEY | No | none | API key (Triton doesn't need one) |
| LLM_MODEL | No | engine-fp8 | Model name |
| DATA_SOURCE | No | db | Dashboard mode: db or blob |
| AZURE_BLOB_URL | No (Azure only) | — | Blob storage base URL |
| AZURE_BLOB_CONN_STR | No (Azure only) | — | Storage connection string |
| AZURE_BLOB_CONTAINER | No | trading-snapshots | Blob container name |
Appendix B: File Structure
dgx-trading-system/
├── model/ldtm/
│ ├── config.py LDTMConfig dataclass
│ ├── model.py LDTMModel (LSTM + 3 heads)
│ ├── dataset.py OHLCVDataset + build_inference_window
│ ├── trainer.py Training loop CLI
│ ├── predict.py Inference CLI
│ ├── evaluator.py Evaluation metrics
│ ├── export.py ONNX export
│ ├── db_log.py Silent DB logger
│ ├── orchestrate.py GPU-aware parallel dispatcher
│ ├── run_ldtm.sh Main orchestration entry point
│ ├── schema.sql ldtm_run_log DDL
│ ├── snapshots_schema.sql ldtm_daily_snapshots DDL
│ ├── snapshot_writer.py Inference → snapshot upsert
│ ├── snapshot_fillback.py Fill actuals into snapshots
│ └── Dockerfile NGC PyTorch base
│
├── llm/
│ ├── llm_query.py LLM context + query CLI
│ ├── Dockerfile python:3.11-slim
│ └── requirements.txt
│
├── dashboard/
│ ├── app.py Streamlit dashboard
│ ├── export_to_blob.py Azure Blob exporter
│ ├── Dockerfile python:3.11-slim
│ └── requirements.txt
│
├── schedule/
│ ├── install_cron.sh Install all cron entries
│ ├── run_ldtm_infer.sh Daily inference pipeline
│ ├── run_ldtm_canary_retrain.sh Weekly 3-ticker retrain
│ ├── run_ldtm_monthly_retrain.sh Full monthly retrain
│ └── run_blob_export.sh Azure Blob nightly export
│
├── docker-compose.yml
└── .env