Project: AutoTransformer — Bounded Self-Improving Architecture Search

One-Line Version

Built a self-improving transformer that trains a single global model across 20+ stocks simultaneously, proposes and applies its own architecture changes within explicitly bounded limits — and documents what it cannot yet be trusted to conclude before any results come in.

The Situation

LDTM trains one LSTM per ticker — 103 separate models, each learning only from its own history. That design is right for 30-day sequence windows where per-ticker volatility regimes matter. But it leaves a different question unanswered: can a single model trained across the full cross-section of stocks — all tickers simultaneously, across a full trading year of context — learn something no single-ticker model can?

The harder design challenge was the search loop. The history of ML research is full of systems that search their own hyperparameters until they find something that looks good on whatever data was available — then declare success. The goal here was different: build a system that searches within explicit bounds, stops when improvement flatlines rather than when budget runs out, logs every change it makes and the exact metric that triggered it, and documents what the current version cannot yet be trusted to conclude before results exist. The discipline of the search matters more than whether the current result has edge.

What I Built — Technical

Architecture: One Model, All Tickers

The key design choice that separates this from LDTM: one transformer trained on all tickers simultaneously, not one model per ticker. Each training sample is a 252-day window — one full trading year — for one ticker. The model receives a learned ticker embedding added to the input projection at every timestep, giving the shared weights per-ticker context without requiring separate parameters per stock.

Input: (batch, 252, feature_dim)
  → input_projection: Linear(feature_dim → d_model=256)
  + position_embedding (learned, not sinusoidal)
  + ticker_embedding(ticker_id) broadcast across time
  → 6 × TransformerBlock
      MultiheadAttention (8 heads, d_model=256, batch_first=True)
      LayerNorm + residual
      Feedforward: Linear(256→1024) → GELU → Dropout → Linear(1024→256)
      LayerNorm + residual
  → final LayerNorm → pool last token
  → prediction head: Linear(256→256) → GELU → Dropout → Linear(256→9)

Nine prediction targets: for each of three horizons (1-day, 5-day, 20-day), predict the log return of the future close, future rolling high, and future rolling low. Log returns because they are stationary, cross-ticker comparable, and better behaved under optimization. At inference time, predicted log returns convert back to price ranges using the current close.

Why Transformer here and LSTM in LDTM — and why both are right. At a 252-day sequence, the Transformer's self-attention earns its cost. The model needs to look across a full year to detect earnings-cycle patterns, seasonal regime shifts, and cross-asset momentum. At LDTM's 30-day sequence, the LSTM's O(n) implicit attention is better — attention at that scale does not need to span full positional history, and the Transformer's O(n²) attention cost buys nothing meaningful. Both architecture choices are correct for their respective problems. Choosing a Transformer here and an LSTM there is not inconsistency — it is judgment applied to different sequence lengths, cross-section structures, and questions.

Feature Engineering — Cross-Ticker Comparable Representations

Built a 15-feature set designed for cross-ticker comparability. Everything is price-normalized so NVDA at $900 and CPRT at $34 produce comparable representations without separate scalers per stock:

Log returns at 1-day, 5-day, 20-day
Overnight gap (open vs previous close, normalized)
Intraday range (high minus low, normalized by close)
Intraday position (close relative to day's high-low range)
Close-to-high and close-to-low distances
Log volume, 20-day volatility, 20-day volume z-score
MA-20 and MA-60 ratio (price relative to trend)
ATR-20 ratio (average true range as fraction of price)

Optional exogenous inputs — newsflow features from the Sentiment Engine (24h and 72h score_flow, article counts, source disagreement) and GDELT macro tone features — are joinable at the feature layer and controlled by config flags. Sentiment features are off by default (use_newsflow: False) until the OHLCV baseline is validated. This mirrors the Sentiment Engine's own four-phase sequencing: don't activate the additional signal layer until you know what the baseline alone is doing.

Range Consistency Penalty

Standard regression loss has no awareness of physical constraints. A model can predict a future low that exceeds the future high, or a close that sits outside the high-low band — geometrically impossible predictions that inflate apparent accuracy on the training objective while producing nonsense at inference.

Added a range consistency penalty that applies ReLU penalties for each geometric violation:

def range_consistency_penalty(predictions, horizons):
    for each horizon:
        close_pred = predictions[:, base]
        high_pred  = predictions[:, base + 1]
        low_pred   = predictions[:, base + 2]

        penalties += relu(low_pred  - close_pred).mean()  # low > close
        penalties += relu(close_pred - high_pred).mean()  # close > high
        penalties += relu(low_pred  - high_pred).mean()   # low > high

composite_loss = huber_loss + 0.10 × range_consistency_penalty

Huber loss for the regression objective because it is robust to the fat-tailed return distributions that MSE handles poorly. The 0.10 penalty weight is a tunable hyperparameter. The penalty is not a soft suggestion — it is a geometric constraint baked into the loss surface.

Bounded Autonomous Controller

The controller follows a strict priority-ordered decision tree — not a free-form agent:

If range_violation_rate > 10% and dropout < 0.20 → increase dropout by 0.05
If improvement below threshold and num_layers < 10 → add 2 layers
If improvement below threshold and d_model < 384 → widen by 128 (feedforward scales with d_model × 4)
If directional_accuracy < 52% and learning_rate > 3e-4 → halve learning rate

Stop criteria: if the last stabilization_rounds=2 consecutive iterations each produce improvement below improvement_threshold=0.0025, the loop terminates. Maximum iterations capped at 8 regardless. Hard-coded parameter bounds (layers ≤ 10, d_model ≤ 384) prevent the controller from building an unboundedly large model. Every decision is logged with the exact metric that triggered it.

Before each iteration, the controller writes a structured JSON review prompt — current config, last 3 iteration summaries, current metrics, and five specific questions about architecture stability, depth, width, overfitting, and whether newsflow features should be activated. This prompt is designed to be sent to an LLM for a second-opinion review, constrained to recommend only changes within the allowed search space. The recommendation and the controller's resulting action are both written to PostgreSQL and the iteration artifact directory.

Per-Iteration Artifact Logging

Every iteration produces a complete artifact directory:

config.json — exact configuration used
metrics.json — close/high/low MAE, directional accuracy, range violation rate, training time, GPU memory, epoch time
training_curve.png — train vs validation loss across epochs
attention_summary.md — top-10 attended positions from the final encoder layer on the first validation batch
sample_predictions.csv — actual vs predicted log returns for the full validation set
model.pt — state dict + config + metrics snapshot
iteration_review_prompt.json — structured LLM review prompt for that iteration

The attention summary is deliberately lightweight: aggregate attention weights averaged across heads and batch, report the top-10 sequence positions the final layer's last token attended to. Not full mechanistic interpretability — enough to see whether the model attends to recent history or older context, and whether that changes across iterations.

What the Current Version Can and Cannot Be Trusted to Conclude

This section was written before any training results existed. That timing is intentional.

What the scaffold proves: CUDA training works, the bounded controller loop executes and terminates correctly, metrics and artifacts are logged per iteration, the range consistency penalty functions, and the model produces geometrically consistent range predictions.

What it cannot yet be trusted to conclude: whether the model has real edge over naive baselines. Three hardening steps are documented in NEXT_STEPS.md before a valid research conclusion can be drawn:

Purged time split — the current adjacent fractional split creates overlapping 252-day windows where train and validation share nearly identical recent context. A proper purged/embargoed split removes this contamination.
Explicit naive baselines — zero-return, persistence, and ATR-style range predictors need to be established before the model's numbers mean anything.
Horizon-by-horizon evaluation — determine which forecast horizons actually benefit from the transformer and which are just following recent price.

Writing the critique before results exist removes the temptation to rationalize around weaknesses afterward. The hardening list is not aspirational — it is a condition on the validity of any conclusion the system produces.

What I Built — Leadership and Judgment

Drew a clear line between bounded search and autonomous agent. The controller is not an agent. It follows a priority-ordered decision tree, operates within explicit parameter bounds, stops on a metric-based criterion, and explains every change it makes with a single logged rule. That constraint is deliberate: unbounded autonomous self-modification is harder to debug, harder to trust, and harder to explain to a skeptical reviewer. The bounded design produces a slower-moving but auditable experiment history. Auditability is a feature.

Documented the known weaknesses before results existed. NEXT_STEPS.md was written immediately after the scaffold worked — before any training numbers were in hand. This prevents the failure mode where a search produces numbers that look reasonable, and the overlapping validation windows are only discovered after the results are already being discussed. Writing the critique first removes the rationalization window.

Sequenced the sentiment integration deliberately. Newsflow features are off by default. The config comment is explicit: activate only after the OHLCV baseline is stable, because otherwise the search space becomes too large too early and the controller cannot distinguish architecture improvements from feature improvements. The same sequencing discipline that structured the Sentiment Engine's four-phase build is applied here at the architecture layer.

Results

Dimension	Detail
Model type	Global multi-ticker transformer (one model, all tickers)
Sequence length	252 trading days (one full year of context)
Prediction targets	9 outputs: close/high/low log returns at 1d, 5d, 20d
Base architecture	6 layers, d_model=256, 8 heads, feedforward_dim=1024
Controller search space	Layers: 6→10, d_model: 256→384, dropout: 0.10→0.20, LR decay
Max iterations	8 (stops earlier on stabilization criterion)
Artifacts per iteration	config, metrics, attention summary, predictions, training curve, checkpoint
DB tables	autotransformer_run_log, autotransformer_iteration_log
Current status	Scaffold operational — CUDA training, controller loop, logging all verified
Next hardening step	Purged time split, naive baselines, horizon-by-horizon evaluation
Dashboard	Dev Lab tab — run summary, iteration comparison, written findings

What This Shows About How I Work

I make architecture decisions based on the specific problem, not on what is fashionable. Using a Transformer here and an LSTM in LDTM is not inconsistency — it is judgment applied to two different sequence lengths, two different cross-section structures, and two different questions. I build autonomous systems with explicit limits rather than open-ended agents, because a system that can explain every decision it makes is more useful than one that produces better numbers you can't interrogate. And I document what the current version cannot yet be trusted to do before results are in hand, because that is the only time the critique is honest.

Technologies

Python 3.11 · PyTorch (nn.Transformer, nn.MultiheadAttention, nn.Embedding, AdamW, FP16 AMP, GradScaler) · NVIDIA GB10 Blackwell · CUDA · NGC PyTorch container · Custom Huber loss + range consistency penalty · pandas · NumPy · SQLAlchemy · PostgreSQL 15 · Matplotlib · Streamlit Dev Lab · Sentiment Engine newsflow_features (optional exogenous input) · GDELT macro tone (optional) · Docker · NVIDIA Container Toolkit

Relevant For

Role	Why This Story Fits
Staff / Principal ML Engineer	Global vs per-ticker architecture tradeoffs, geometric constraint loss design, bounded autonomous search with explicit stop criteria
AI / ML Infrastructure Engineer	Transformer on GPU with full AMP, per-iteration artifact management, structured LLM integration as bounded experiment guidance
Head of AI / Research Lead	Self-improving system with guardrails — bounded search, stop criteria, honest pre-emptive critique of current evaluation weaknesses
Quant / Algo Research	Multi-horizon range prediction (high/low/close), log return targets, geometric consistency in loss function
Technical Founding Role	End-to-end ownership: architecture, loss design, training loop, controller logic, artifact logging, database, dashboard — all from scratch