AutoTransformer — Kiro-Style Design Document

Date: 2026-05-08

Project: autotransformer Target Runtime: NVIDIA CUDA via Docker on DGX / GB10 Status: Design approved for first implementation slice

1. Context

We want a local-first autonomous transformer project that predicts future stock ranges using:

roughly 20 years of daily market data
around 100 stocks
OHLCV as the core signal
Finnhub, Interactive Brokers, and GDELT as optional context features

The model should:

start with a 6-layer transformer and add or remove layers as needed
train on 80% of data and use the remaining for validation and backtesting as applicable
train on NVIDIA GPU
log internal states in a human-readable way
evaluate itself after each training cycle
ask an LLM or coding agent to review metrics and suggest the next adjustment
continue iterating until performance stabilizes
automatically adjust hyperparameters based on performance metrics
Log time taken to run through layers and epochs for dashboarding
Use any technical indicators data already present without over fitting or creating bias

This project should remain understandable to a typical programmer. Readability is a requirement, not a nice-to-have.

2. Problem Statement

Predict future high, low, and close ranges for multiple forward horizons from daily historical data.

The immediate design goal is not "perfect autonomy." The immediate design goal is:

a stable training loop
clear experiment logs
architecture search with guardrails
visible internal behavior
reproducible final model selection

3. Objectives

Primary Objectives

Build a transformer-based forecasting system for future price ranges.
Support multi-horizon outputs such as:
- next day
- next 5 trading days
- next 20 trading days
Run the full training and evaluation cycle on CUDA-enabled NVIDIA hardware.
Log model behavior, training metrics, attention summaries, and sample predictions per iteration.
Add a bounded architecture controller that can adjust depth and selected hyperparameters after each cycle.

Secondary Objectives

Incorporate sentiment and macro/news context from:
- Finnhub
- Interactive Brokers
- GDELT
Expose a human-readable experiment history that explains:
- what changed
- why it changed
- whether it helped

Non-Goals for Version 1

Full AutoML across every possible architecture
Tick-level or intraday modeling
Reinforcement learning
Unbounded autonomous self-modification
Production live-trading execution

4. Key Design Decision

Use a bounded autonomous search loop, not an unbounded self-tuning system.

That means:

start from a known baseline
allow only a controlled set of changes per iteration
require metric-based justification for each change
stop after a small number of non-improving rounds

This is safer, easier to debug, and much more readable than a free-form agent loop.

5. Proposed Data Inputs

Core Inputs

Per ticker, per day:

open
high
low
close
volume

Derived Market Features

Recommended first-pass engineered features:

daily return
log return
intraday range
close-to-high distance
close-to-low distance
rolling volatility
rolling volume z-score
moving average ratios
ATR-style range features
gap features

Cross-Sectional / Context Features

Use a global model, not 100 separate models, for the first implementation.

Add:

ticker embedding
sector or group embedding if available
market-wide context features from indices or ETF proxies later if needed

News / Sentiment Features

Use news context as optional exogenous inputs, not as the first dependency.

Recommended order:

OHLCV-only baseline
Add newsflow_features from the Sentiment Engine
Add GDELT macro tone features separately

Suggested sentiment fields:

24h score_flow
72h score_flow
article count
positive ratio
negative ratio
source disagreement

Suggested GDELT fields:

rolling mean tone
rolling tone z-score
event volume
macro stress proxy counts

6. Prediction Targets

Predict future range values over multiple horizons.

Recommended target formulation:

close_return_h1
future_high_return_h1
future_low_return_h1
close_return_h5
future_high_return_h5
future_low_return_h5
close_return_h20
future_high_return_h20
future_low_return_h20

Why returns instead of raw prices:

easier normalization
better cross-ticker comparability
more stable optimization

At inference time, convert predicted returns back to price ranges using the latest close.

7. Baseline Model Architecture

Initial Architecture

Start with:

6 transformer encoder layers
model dimension around 256
8 attention heads
feed-forward dimension around 1024
dropout around 0.1
sequence length around 252 trading days

Input Layout

Use a global multi-ticker model:

one training sample = one ticker window
static ticker embedding
dynamic daily features across the lookback window

Output Head

Use multi-head regression:

one head per horizon
each head predicts:
- future close return
- future high return
- future low return

Loss

Recommended V1 loss:

weighted Huber or MSE for returns

Recommended V2 improvement:

add range-consistency penalties so:
- predicted low does not exceed predicted close
- predicted high does not fall below predicted close
- predicted high remains above predicted low

8. Why a Transformer Here

A transformer is attractive because it can:

look across long historical windows
model temporal relationships without only relying on recurrence
produce attention summaries that are easier to inspect than many alternatives
scale cleanly with global multi-ticker training

That said, the design should remain humble:

a transformer is not automatically better than TFT or LSTM
the benchmark must prove the value

9. Autonomous Iteration Loop

Principle

The system should not directly rewrite its own source code freely.

Instead, it should run a controller loop like this:

Train current configuration
Evaluate on validation and holdout sets
Log metrics and interpretable artifacts
Produce a structured summary for an LLM or coding agent
Ask for a constrained next-step recommendation
Apply only allowed changes
Repeat until stop criteria are met

Allowed Hyperparameter Changes Per Iteration

Recommended initial search space:

number of layers: 4, 6, 8, 10
model dimension: 128, 256, 384
attention heads: 4, 8
dropout: 0.05, 0.10, 0.15, 0.20
learning rate
batch size
lookback window
sentiment feature on/off
GDELT feature on/off

Suggested LLM Prompt Contract

The controller should send structured context, not raw logs.

Example sections:

current config
previous 3 iteration summaries
validation metrics
overfitting indicators
calibration notes
attention summary
example predictions
GPU throughput and runtime

And ask:

Should we keep the architecture?
Should we increase or decrease layers?
Should we widen or narrow the model?
Should we turn on sentiment features yet?
Are we overfitting?

Stop Criteria

Define "optimal" as stabilized rather than globally optimal.

Recommended stopping rules:

maximum 8 to 15 full iterations
stop after 2 or 3 consecutive non-improving rounds
stop when improvement is smaller than a threshold such as:
- delta_val_loss < 0.25%
- delta_range_coverage < 0.2%

10. Human-Readable Internal State Logging

Important note:

These are not the model's literal "thoughts."

They are interpretable diagnostic views of internal behavior.

What to Log Each Iteration

configuration used
train / validation / holdout metrics
learning curves
attention maps for selected samples
embedding projection snapshots
sample predictions vs actual ranges
worst-case examples
best-case examples
calibration / uncertainty summaries
GPU memory, throughput, and epoch time

Human-Readable Artifacts

Recommended artifacts per iteration:

config.json
metrics.json
attention_summary.md
sample_predictions.csv
layer_probe_summary.md
training_curve.png
attention_heatmap_<sample>.png

Layer Probe Guidance

Keep this simple at first:

capture the first embedding projection
capture attention focus from the final encoder layer
capture 3 to 5 example hidden-state summaries
capture 5 sample predictions before and after training

Avoid building a giant interpretability platform in V1.

11. Metrics

Primary Metrics

MAE on future close return
MAE on future high return
MAE on future low return
RMSE on future close return
range coverage accuracy
directional accuracy for close return

Useful Secondary Metrics

calibration of predicted ranges
high-low spread error
correlation between predicted and realized returns
horizon-wise error breakdown
per-ticker error distribution

Selection Metric

Use a weighted composite selection score, for example:

selection_score =
  0.35 * normalized_close_mae +
  0.25 * normalized_high_mae +
  0.25 * normalized_low_mae +
  0.10 * normalized_range_coverage_error +
  0.05 * runtime_penalty

This prevents the controller from choosing a giant slow model for a tiny accuracy gain.

12. Proposed Project Structure

Recommended folder shape:

autotransformer/
  README.md
  KIRO_DESIGN.md
  Dockerfile
  requirements.txt
  config.py
  dataset.py
  features.py
  model.py
  losses.py
  trainer.py
  evaluate.py
  controller.py
  prompts/
    iteration_review.txt
  reports/
  artifacts/
  sql/
    schema.sql

For this first step, only the design package is required. The code scaffold can come next.

13. Runtime and CUDA Design

Container Base

Use:

nvcr.io/nvidia/pytorch:25.01-py3

Reason:

matches the existing DGX GPU workflow in this repository
keeps CUDA and PyTorch aligned
avoids host Python pollution

GPU Strategy

Version 1 should use:

single-GPU training on DGX / GB10
mixed precision if stable
gradient accumulation only if needed

This project does not need multi-GPU complexity on day one.

14. Logging and Persistence

Database Tables to Consider

Recommended tables:

autotransformer_run_log
autotransformer_iteration_log
autotransformer_model_registry
autotransformer_prediction_snapshots
autotransformer_attention_artifacts

Minimum Per-Iteration DB Fields

run_id
iteration_id
created_at
model_depth
model_dim
num_heads
dropout
learning_rate
batch_size
lookback_days
sentiment_enabled
gdelt_enabled
train_loss
val_loss
close_mae
high_mae
low_mae
range_coverage
directional_accuracy
training_minutes
gpu_memory_gb
llm_recommendation
controller_action

15. Risks and Caveats

Real Risk 1: "Optimal" Is Not Absolute

This project can converge to a strong local configuration. It cannot guarantee a universal optimum.

Real Risk 2: LLM Recommendations Can Be Noisy

The controller should advise within a bounded search space. It should not be treated as a mathematically reliable optimizer.

Real Risk 3: News Features Can Hurt Before They Help

Sentiment and GDELT inputs should be added after the OHLCV baseline works. Otherwise the search space becomes too large too early.

Real Risk 4: Interpretability Can Balloon Scope

Readable layer probes are useful. A full mechanistic interpretability system is not needed for V1.

16. Recommended Build Phases

Phase A — Foundation

build dataset pipeline
define targets
build baseline transformer
train a single fixed 6-layer model

Phase B — Benchmarking

compare 4 / 6 / 8 layers
compare 128 / 256 / 384 dimensions
log throughput and metrics

Phase C — Autonomous Controller

add structured iteration summaries
add LLM review prompt
constrain controller actions
stop on stabilization criteria

Phase D — News Context

add newsflow_features
add GDELT aggregates
re-run bounded search

Phase E — Dashboard and Reporting

model catalog page
experiment history
architecture evolution view
attention and sample prediction gallery

17. Approximate Delivery Timeline

These are realistic working estimates, not guarantees.

Design and Scaffold

0.5 to 1.5 days

First Trainable Baseline

2 to 4 days

First Full Controlled Experiment Loop

3 to 7 days

Adding Newsflow + GDELT Features

2 to 5 days

Usable Dashboard / Reporting Layer

1 to 3 days

Total to First Stabilized Version

roughly 1 to 3 weeks of focused work

18. Approximate Training Runtime on DGX / GB10

Actual runtime depends on:

lookback length
number of features
batch size
number of horizons
whether sentiment features are enabled
how many experiments are allowed

Reasonable estimate for V1:

one baseline training cycle: 1 to 4 hours
one larger comparison round: 4 to 12 hours
one bounded search run of 8 to 12 iterations: 1 to 4 days wall clock

If early stopping and pruning are implemented well, the lower end is realistic.

19. Definition of "Good Enough"

For this project, good enough means:

reproducible train / validation / holdout split
readable experiment history
stable bounded controller loop
no major metric improvement over several rounds
final chosen architecture recorded with evidence
sample predictions and attention summaries visible

That is the practical definition of "optimal state" for V1.

20. Recommended First Implementation Slice

If we keep this small and smart, the first implementation should be:

Global OHLCV-only transformer
6 layers
252-day lookback
horizons 1, 5, 20
single-GPU CUDA training
structured iteration logs to Postgres + files
one bounded controller that can only change:
- layers
- model dimension
- dropout
- learning rate

Do not start with:

full sentiment integration
GDELT event taxonomy
multi-GPU training
open-ended agent refactors

21. Final Recommendation

Yes, this project is feasible on your NVIDIA setup.

The best way to do it is:

start with a narrow, readable baseline
use a bounded controller rather than unrestricted autonomy
treat LLM feedback as experiment guidance, not as truth
define stabilization instead of pretending to reach a universal optimum
add sentiment and GDELT only after the OHLCV baseline is stable

The estimated path to a strong first stabilized version is:

roughly 1 to 3 weeks

The estimated path to a richer mature version with news context, reporting, and repeated controller runs is:

roughly 3 to 6 weeks

22. Acceptance Criteria for Version 1

autotransformer/ design package exists
baseline transformer can train on CUDA
iteration logs are written every cycle
LLM/controller recommendations are stored
architecture changes are bounded and traceable
final selected architecture is reported
final evaluation metrics are exported
internal states are summarized in a human-readable way