LDTM v2 — Operations Runbook

Model: Long-Duration Temporal Model (LDTM) v2
Date: 2026-04-21
System: NVIDIA GB10 DGX, Ubuntu, Docker, PostgreSQL 15

Quick Reference

Task	Command
Run inference — single ticker	`bash model/ldtm/run_ldtm.sh --ticker AAPL --mode infer`
Run inference — all tickers	`bash model/ldtm/run_ldtm.sh --all --mode infer --parallel 16`
Full retrain — all tickers	`bash model/ldtm/run_ldtm.sh --all --mode train --parallel 16`
Initialize DB tables	`bash model/ldtm/run_ldtm.sh --init-db`
Apply snapshot schema	See §3.1
Start dashboard	`docker compose --profile dashboard up -d dashboard`
LLM query	`docker compose --profile llm run --rm llm-query --group mega_cap --question "..."`
Install cron jobs	`bash schedule/install_cron.sh`

Prerequisites

1. System Requirements

Component	Requirement
OS	Ubuntu 22.04+
Docker	24.x+ with NVIDIA Container Toolkit
GPU	NVIDIA GB10 (or any CUDA 12.x GPU)
RAM	32 GB+ host RAM
Disk	50 GB+ for model weights + Docker images
PostgreSQL	Running as `trading-postgres` container
IB Gateway	Running at `TWS_HOST:TWS_PORT` (for data ingestion)

2. Verify Prerequisites

# Docker running
docker info | grep "Server Version"

# GPU accessible
nvidia-smi

# PostgreSQL healthy
docker exec trading-postgres pg_isready -U postgres -d trading

# .env file present
ls -la .env

# Required env vars
grep -E "DB_HOST|DB_NAME|DB_USER|DB_PASSWORD" .env

3. Required `.env` Keys

DB_HOST=localhost
DB_PORT=5432
DB_NAME=trading
DB_USER=postgres
DB_PASSWORD=<your_password>
TWS_HOST=<ib_gateway_host>
TWS_PORT=7497   # 7497=paper, 7496=live

Part 1: First-Time Setup

1.1 Build the LDTM Docker Image

cd /home/aimikamirai/projects/dgx-trading-system
docker build -t model-ldtm ./model/ldtm

Expected: ~5-10 minutes (downloads NGC PyTorch base image ~22GB on first run).

Verify:

docker images model-ldtm
# model-ldtm   latest   <sha>   <date>   22.1GB

1.2 Build Supporting Images

docker build -t trading-dashboard ./dashboard
docker build -t ldtm-llm-query ./llm

Both are fast (~2 minutes, python:3.11-slim base).

1.3 Initialize Database Tables

# Create ldtm_run_log table
bash model/ldtm/run_ldtm.sh --init-db

# Create ldtm_daily_snapshots table + accuracy view
docker exec -i trading-postgres psql -U postgres -d trading \
    < model/ldtm/snapshots_schema.sql

Verify:

docker exec trading-postgres psql -U postgres -d trading \
    -c "\dt ldtm_*"
# Should list: ldtm_run_log, ldtm_daily_snapshots

docker exec trading-postgres psql -U postgres -d trading \
    -c "\dv ldtm_*"
# Should list: ldtm_accuracy_30d

1.4 Install Cron Jobs

bash schedule/install_cron.sh

Verify:

crontab -l | grep ldtm
# Should show 4 entries: infer, blob_export, canary_retrain, monthly_retrain

Part 2: Running the Model

2.1 Single Ticker — Train + Infer

# Train only
bash model/ldtm/run_ldtm.sh --ticker AAPL --mode train

# Infer only (requires existing checkpoint)
bash model/ldtm/run_ldtm.sh --ticker AAPL --mode infer

# Both (train then infer)
bash model/ldtm/run_ldtm.sh --ticker AAPL --mode both

Expected output:

[run_ldtm] mode=both  tickers=1  parallel=1  epochs=100
[run_ldtm] ── Ticker 1/1: AAPL ──
[LDTM] Training AAPL  window=30  hidden=128  layers=2
[LDTM] Device: cuda  AMP: True
[LDTM] Dataset loaded in 0.8s  train=3358  val=720
[LDTM] epoch   1/100  train_loss=0.542318  val_loss=0.621045  lr=1.00e-03
...
[LDTM] Early stop — best val_loss=0.577952 at epoch 10
[LDTM] Checkpoint → /model_weights/ldtm/AAPL_ldtm.pt
{"ticker": "AAPL", "next_day_close": 268.55, ...}

2.2 Named Group

bash model/ldtm/run_ldtm.sh --group mega_cap --mode both --parallel 4

Groups available: mega_cap semis software internet consumer healthcare industrial telecom financials hardware intl etfs

2.3 All 103 Tickers

# Full train + infer (first run or monthly retrain)
bash model/ldtm/run_ldtm.sh --all --mode both --parallel 16

# Daily inference only (after checkpoints exist)
bash model/ldtm/run_ldtm.sh --all --mode infer --parallel 16

Expected timing (GB10, 4 slots, Triton competing):

Training all 103: ~20 minutes wall clock
Inference all 103: ~3 minutes wall clock
Training all 103 (16 slots, no Triton): ~8 minutes

2.4 Snapshot Pipeline (after inference)

# Write today's predictions to ldtm_daily_snapshots
docker run --rm --network host --env-file .env \
    -v "$(pwd)/model/ldtm:/app" -w /app \
    python:3.11-slim \
    sh -c "pip install psycopg2-binary sqlalchemy -q && python snapshot_writer.py"

# Fill actuals for past predictions
docker run --rm --network host --env-file .env \
    -v "$(pwd)/model/ldtm:/app" -w /app \
    python:3.11-slim \
    sh -c "pip install psycopg2-binary sqlalchemy pandas -q && python snapshot_fillback.py"

Or using docker-compose (after docker compose build):

docker compose --profile ldtm-tools run --rm ldtm-snapshot-writer
docker compose --profile ldtm-tools run --rm ldtm-snapshot-fillback

2.5 Backfill Historical Snapshots

If running for the first time with existing ldtm_run_log data:

docker run --rm --network host --env-file .env \
    -v "$(pwd)/model/ldtm:/app" -w /app \
    python:3.11-slim \
    sh -c "pip install psycopg2-binary sqlalchemy -q && python snapshot_writer.py --backfill"

Part 3: Dashboard

3.1 Start Dashboard (Local)

docker compose --profile dashboard up -d dashboard

Access at: http://localhost:8501

3.2 Stop Dashboard

docker compose --profile dashboard down

3.3 Check Dashboard Logs

docker logs trading-dashboard --tail 50

3.4 Rebuild Dashboard (after code changes)

docker compose --profile dashboard build dashboard
docker compose --profile dashboard up -d dashboard

Part 4: LLM Query Interface

4.1 Basic Query

docker compose --profile llm run --rm llm-query \
    --question "What is the overall market signal today?"

4.2 Ticker-Specific Query

docker compose --profile llm run --rm llm-query \
    --ticker NVDA \
    --question "What does the model predict for NVDA this week and what is the recent accuracy?"

4.3 Group Query

docker compose --profile llm run --rm llm-query \
    --group semis \
    --question "Which semiconductor stocks have the strongest momentum?"

4.4 Custom Ticker List

docker compose --profile llm run --rm llm-query \
    --tickers "NVDA AMD AMAT MU MRVL" \
    --question "Rank these by 1-month implied upside"

4.5 All Tickers Market Summary

docker compose --profile llm run --rm llm-query \
    --all \
    --question "Summarize the market in 5 bullet points based on today's predictions"

Note: --all passes all 103 tickers as context. This uses ~15K tokens and is near the Mistral-7B context limit. Use groups for faster, cleaner responses.

Part 5: Querying the Database

5.1 Today's Predictions (all tickers, ranked by 1-month return)

SELECT s.ticker, s.run_date, s.run_date_close,
       s.next_day_close_pred,
       s.one_month_close_pred,
       ROUND((((s.one_month_close_pred / NULLIF(s.next_day_close_pred,0)) - 1) * 100)::numeric, 2)
           AS implied_1m_return_pct
FROM ldtm_daily_snapshots s
WHERE s.run_date = (SELECT MAX(run_date) FROM ldtm_daily_snapshots)
ORDER BY implied_1m_return_pct DESC NULLS LAST;

5.2 Accuracy Leaderboard

SELECT * FROM ldtm_accuracy_30d LIMIT 20;

5.3 Training Run History

SELECT ticker, DATE(run_at) AS run_date, best_val_loss, epochs_run, duration_sec
FROM ldtm_run_log
WHERE mode = 'train' AND status = 'success'
ORDER BY run_at DESC
LIMIT 20;

5.4 Ticker Prediction History (with actuals)

SELECT run_date, next_day_close_pred, next_day_actual,
       next_day_pct_error, next_day_direction_correct
FROM ldtm_daily_snapshots
WHERE ticker = 'AAPL'
ORDER BY run_date DESC
LIMIT 30;

5.5 Failed Runs

SELECT ticker, mode, DATE(run_at) AS run_date, error_msg
FROM ldtm_run_log
WHERE status = 'failed'
ORDER BY run_at DESC
LIMIT 20;

5.6 Model Health Check

SELECT ticker, best_val_loss, epochs_run
FROM (
    SELECT DISTINCT ON (ticker) ticker, best_val_loss, epochs_run
    FROM ldtm_run_log
    WHERE mode = 'train' AND status = 'success'
    ORDER BY ticker, run_at DESC
) latest
ORDER BY best_val_loss ASC;

Part 6: Maintenance

6.1 Force Retrain a Single Ticker

# Remove old checkpoint first (optional — training overwrites it anyway)
docker run --rm -v trading_model_weights:/model_weights \
    busybox rm -f /model_weights/ldtm/AAPL_ldtm.pt

# Retrain
bash model/ldtm/run_ldtm.sh --ticker AAPL --mode train

6.2 Monthly Full Retrain (manual trigger)

bash schedule/run_ldtm_monthly_retrain.sh
# Note: this checks date guard; override by calling run_ldtm.sh directly:
bash model/ldtm/run_ldtm.sh --all --mode train --parallel 16

6.3 Check Checkpoint Sizes

docker run --rm -v trading_model_weights:/model_weights \
    busybox ls -lh /model_weights/ldtm/ | head -20
# Each .pt file should be ~900KB (227K params × 4 bytes FP32)

6.4 Verify Snapshot Fill-Back Is Working

After 2+ days of inference runs:

docker exec trading-postgres psql -U postgres -d trading -c "
SELECT ticker, run_date, next_day_close_pred, next_day_actual,
       next_day_direction_correct, next_day_pct_error
FROM ldtm_daily_snapshots
WHERE next_day_actual IS NOT NULL
ORDER BY run_date DESC, ticker
LIMIT 10;
"

6.5 Rebuild Model Image

After code changes to model/ldtm/*.py:

docker build -t model-ldtm ./model/ldtm
# All checkpoints remain valid unless config.py changes (input_size, hidden_size, etc.)
# If LDTMConfig changes: delete old checkpoints and retrain

6.6 Disk Cleanup

# Remove old Docker images
docker image prune -f

# Check volume size
docker system df

# Check model weights volume
docker run --rm -v trading_model_weights:/model_weights \
    busybox du -sh /model_weights/ldtm
# Expected: ~103 × 0.9MB ≈ 93MB

Part 7: Troubleshooting

T1: "No trained model at /model_weights/ldtm/AAPL_ldtm.pt"

Cause: Checkpoint doesn't exist (first run, or volume was reset).

Fix:

bash model/ldtm/run_ldtm.sh --ticker AAPL --mode train

T2: Model Predicts Wildly Wrong Prices

Cause: Insufficient data history in market_data_daily for this ticker.

Diagnosis:

docker exec trading-postgres psql -U postgres -d trading -c "
SELECT ticker, COUNT(*) AS rows, MIN(date), MAX(date)
FROM market_data_daily
WHERE ticker IN ('NFLX','BKNG','STX')
GROUP BY ticker ORDER BY rows ASC;
"

Fix: Trigger a full history backfill for the affected ticker:

docker compose --profile ingest run --rm -e TICKER=NFLX ingestion-ticker

T3: `val_loss` stuck above 1.0

Cause: Ticker has high inherent volatility (MSTR, TSLA, DDOG) or very short history.

Action: This is expected for volatile tickers. Consider:

Increasing --epochs 200 for more training iterations
The direction signal is still meaningful even if absolute error is high

T4: Cron Jobs Not Running

Diagnosis:

crontab -l
grep ldtm /var/log/syslog | tail -20
cat schedule/logs/ldtm_infer.log | tail -20

Fix:

bash schedule/install_cron.sh

T5: Dashboard Shows "No snapshot data"

Cause: ldtm_daily_snapshots is empty or snapshot_writer.py hasn't run yet.

Fix:

# Backfill from existing ldtm_run_log data
docker run --rm --network host --env-file .env \
    -v "$(pwd)/model/ldtm:/app" -w /app python:3.11-slim \
    sh -c "pip install psycopg2-binary sqlalchemy -q && python snapshot_writer.py --backfill"

T6: LLM Query Returns "Connection refused"

Cause: Triton/Mistral-7B server is not running (it's a separate process, not managed by this project).

Diagnosis:

curl -s http://localhost:8000/v1/models
# Should return {"object":"list","data":[{"id":"engine-fp8",...}]}

Workaround: Set LLM_BASE_URL to Anthropic or OpenAI endpoint:

# In .env:
LLM_BASE_URL=https://api.anthropic.com/v1
LLM_API_KEY=sk-ant-...
LLM_MODEL=claude-sonnet-4-6

T7: PostgreSQL Connection Refused

# Check container is running
docker ps | grep trading-postgres

# If not running:
docker compose up -d postgres

# If running but not accepting connections:
docker exec trading-postgres pg_isready -U postgres
docker logs trading-postgres --tail 20

Part 8: Azure Deployment (When Ready)

8.1 Prerequisites

# Install Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
az login

8.2 One-Time Setup

# Create resource group + storage
az group create --name trading-dashboard --location eastus2
az storage account create --name dgxtradingdata \
    --resource-group trading-dashboard \
    --sku Standard_LRS \
    --allow-blob-public-access true
az storage container create --name trading-snapshots \
    --account-name dgxtradingdata \
    --public-access blob

# Get and save connection string
az storage account show-connection-string \
    --name dgxtradingdata \
    --resource-group trading-dashboard \
    --query connectionString -o tsv
# → Add to .env as AZURE_BLOB_CONN_STR=...

# Create Container Registry
az acr create --name dgxtradingdash \
    --resource-group trading-dashboard \
    --sku Basic

# Create Container App environment
az containerapp env create \
    --name trading-env \
    --resource-group trading-dashboard \
    --location eastus2

8.3 Deploy Dashboard

# Login to registry
az acr login --name dgxtradingdash

# Build + push
docker build -t dgxtradingdash.azurecr.io/dashboard:latest ./dashboard
docker push dgxtradingdash.azurecr.io/dashboard:latest

# Deploy
az containerapp create \
    --name trading-dashboard \
    --resource-group trading-dashboard \
    --environment trading-env \
    --image dgxtradingdash.azurecr.io/dashboard:latest \
    --target-port 8501 \
    --ingress external \
    --min-replicas 0 \
    --max-replicas 1 \
    --set-env-vars \
        DATA_SOURCE=blob \
        "AZURE_BLOB_URL=https://dgxtradingdata.blob.core.windows.net/trading-snapshots"

8.4 Update Dashboard Image

docker build -t dgxtradingdash.azurecr.io/dashboard:latest ./dashboard
az acr login --name dgxtradingdash
docker push dgxtradingdash.azurecr.io/dashboard:latest
az containerapp update \
    --name trading-dashboard \
    --resource-group trading-dashboard \
    --image dgxtradingdash.azurecr.io/dashboard:latest

Appendix A: Environment Variables Reference

Variable	Required	Default	Description
DB_HOST	Yes	—	PostgreSQL host
DB_PORT	No	5432	PostgreSQL port
DB_NAME	Yes	—	Database name
DB_USER	Yes	—	DB username
DB_PASSWORD	Yes	—	DB password
TWS_HOST	Yes (ingest)	—	IB Gateway host
TWS_PORT	Yes (ingest)	—	IB Gateway port
LLM_BASE_URL	No	http://localhost:8000/v1	LLM API endpoint
LLM_API_KEY	No	none	API key (Triton doesn't need one)
LLM_MODEL	No	engine-fp8	Model name
DATA_SOURCE	No	db	Dashboard mode: db or blob
AZURE_BLOB_URL	No (Azure only)	—	Blob storage base URL
AZURE_BLOB_CONN_STR	No (Azure only)	—	Storage connection string
AZURE_BLOB_CONTAINER	No	trading-snapshots	Blob container name

Appendix B: File Structure

dgx-trading-system/
├── model/ldtm/
│   ├── config.py              LDTMConfig dataclass
│   ├── model.py               LDTMModel (LSTM + 3 heads)
│   ├── dataset.py             OHLCVDataset + build_inference_window
│   ├── trainer.py             Training loop CLI
│   ├── predict.py             Inference CLI
│   ├── evaluator.py           Evaluation metrics
│   ├── export.py              ONNX export
│   ├── db_log.py              Silent DB logger
│   ├── orchestrate.py         GPU-aware parallel dispatcher
│   ├── run_ldtm.sh            Main orchestration entry point
│   ├── schema.sql             ldtm_run_log DDL
│   ├── snapshots_schema.sql   ldtm_daily_snapshots DDL
│   ├── snapshot_writer.py     Inference → snapshot upsert
│   ├── snapshot_fillback.py   Fill actuals into snapshots
│   └── Dockerfile             NGC PyTorch base
│
├── llm/
│   ├── llm_query.py           LLM context + query CLI
│   ├── Dockerfile             python:3.11-slim
│   └── requirements.txt
│
├── dashboard/
│   ├── app.py                 Streamlit dashboard
│   ├── export_to_blob.py      Azure Blob exporter
│   ├── Dockerfile             python:3.11-slim
│   └── requirements.txt
│
├── schedule/
│   ├── install_cron.sh        Install all cron entries
│   ├── run_ldtm_infer.sh      Daily inference pipeline
│   ├── run_ldtm_canary_retrain.sh  Weekly 3-ticker retrain
│   ├── run_ldtm_monthly_retrain.sh Full monthly retrain
│   └── run_blob_export.sh     Azure Blob nightly export
│
├── docker-compose.yml
└── .env

LDTM v2 — Operations Runbook

Quick Reference

Prerequisites

1. System Requirements

2. Verify Prerequisites

3. Required .env Keys

Part 1: First-Time Setup

1.1 Build the LDTM Docker Image

1.2 Build Supporting Images

1.3 Initialize Database Tables

1.4 Install Cron Jobs

Part 2: Running the Model

2.1 Single Ticker — Train + Infer

2.2 Named Group

2.3 All 103 Tickers

2.4 Snapshot Pipeline (after inference)

2.5 Backfill Historical Snapshots

Part 3: Dashboard

3.1 Start Dashboard (Local)

3.2 Stop Dashboard

3.3 Check Dashboard Logs

3.4 Rebuild Dashboard (after code changes)

Part 4: LLM Query Interface

4.1 Basic Query

4.2 Ticker-Specific Query

4.3 Group Query

4.4 Custom Ticker List

4.5 All Tickers Market Summary

Part 5: Querying the Database

5.1 Today's Predictions (all tickers, ranked by 1-month return)

5.2 Accuracy Leaderboard

5.3 Training Run History

5.4 Ticker Prediction History (with actuals)

5.5 Failed Runs

5.6 Model Health Check

Part 6: Maintenance

6.1 Force Retrain a Single Ticker

6.2 Monthly Full Retrain (manual trigger)

6.3 Check Checkpoint Sizes

6.4 Verify Snapshot Fill-Back Is Working

6.5 Rebuild Model Image

6.6 Disk Cleanup

Part 7: Troubleshooting

T1: "No trained model at /model_weights/ldtm/AAPL_ldtm.pt"

T2: Model Predicts Wildly Wrong Prices

T3: val_loss stuck above 1.0

T4: Cron Jobs Not Running

T5: Dashboard Shows "No snapshot data"

T6: LLM Query Returns "Connection refused"

T7: PostgreSQL Connection Refused

Part 8: Azure Deployment (When Ready)

8.1 Prerequisites

8.2 One-Time Setup

8.3 Deploy Dashboard

8.4 Update Dashboard Image

Appendix A: Environment Variables Reference

Appendix B: File Structure

3. Required `.env` Keys

T3: `val_loss` stuck above 1.0