Back to Blog
technical-referencetransformerauto-mlkiroarchitecturetimeseries

AutoTransformer — Kiro-Style Design Document

KIRO design for the AutoTransformer project: automated architecture search over temporal Transformer variants for financial time series.

October 1, 2025·10 min read

AutoTransformer — Kiro-Style Design Document

Date: 2026-05-08
Project: autotransformer Target Runtime: NVIDIA CUDA via Docker on DGX / GB10 Status: Design approved for first implementation slice

1. Context

We want a local-first autonomous transformer project that predicts future stock ranges using:

  • roughly 20 years of daily market data
  • around 100 stocks
  • OHLCV as the core signal
  • Finnhub, Interactive Brokers, and GDELT as optional context features

The model should:

  • start with a 6-layer transformer and add or remove layers as needed
  • train on 80% of data and use the remaining for validation and backtesting as applicable
  • train on NVIDIA GPU
  • log internal states in a human-readable way
  • evaluate itself after each training cycle
  • ask an LLM or coding agent to review metrics and suggest the next adjustment
  • continue iterating until performance stabilizes
  • automatically adjust hyperparameters based on performance metrics
  • Log time taken to run through layers and epochs for dashboarding
  • Use any technical indicators data already present without over fitting or creating bias

This project should remain understandable to a typical programmer. Readability is a requirement, not a nice-to-have.

2. Problem Statement

Predict future high, low, and close ranges for multiple forward horizons from daily historical data.

The immediate design goal is not "perfect autonomy." The immediate design goal is:

  • a stable training loop
  • clear experiment logs
  • architecture search with guardrails
  • visible internal behavior
  • reproducible final model selection

3. Objectives

Primary Objectives

  • Build a transformer-based forecasting system for future price ranges.
  • Support multi-horizon outputs such as:
    • next day
    • next 5 trading days
    • next 20 trading days
  • Run the full training and evaluation cycle on CUDA-enabled NVIDIA hardware.
  • Log model behavior, training metrics, attention summaries, and sample predictions per iteration.
  • Add a bounded architecture controller that can adjust depth and selected hyperparameters after each cycle.

Secondary Objectives

  • Incorporate sentiment and macro/news context from:
    • Finnhub
    • Interactive Brokers
    • GDELT
  • Expose a human-readable experiment history that explains:
    • what changed
    • why it changed
    • whether it helped

Non-Goals for Version 1

  • Full AutoML across every possible architecture
  • Tick-level or intraday modeling
  • Reinforcement learning
  • Unbounded autonomous self-modification
  • Production live-trading execution

4. Key Design Decision

Use a bounded autonomous search loop, not an unbounded self-tuning system.

That means:

  • start from a known baseline
  • allow only a controlled set of changes per iteration
  • require metric-based justification for each change
  • stop after a small number of non-improving rounds

This is safer, easier to debug, and much more readable than a free-form agent loop.

5. Proposed Data Inputs

Core Inputs

Per ticker, per day:

  • open
  • high
  • low
  • close
  • volume

Derived Market Features

Recommended first-pass engineered features:

  • daily return
  • log return
  • intraday range
  • close-to-high distance
  • close-to-low distance
  • rolling volatility
  • rolling volume z-score
  • moving average ratios
  • ATR-style range features
  • gap features

Cross-Sectional / Context Features

Use a global model, not 100 separate models, for the first implementation.

Add:

  • ticker embedding
  • sector or group embedding if available
  • market-wide context features from indices or ETF proxies later if needed

News / Sentiment Features

Use news context as optional exogenous inputs, not as the first dependency.

Recommended order:

  1. OHLCV-only baseline
  2. Add newsflow_features from the Sentiment Engine
  3. Add GDELT macro tone features separately

Suggested sentiment fields:

  • 24h score_flow
  • 72h score_flow
  • article count
  • positive ratio
  • negative ratio
  • source disagreement

Suggested GDELT fields:

  • rolling mean tone
  • rolling tone z-score
  • event volume
  • macro stress proxy counts

6. Prediction Targets

Predict future range values over multiple horizons.

Recommended target formulation:

  • close_return_h1
  • future_high_return_h1
  • future_low_return_h1
  • close_return_h5
  • future_high_return_h5
  • future_low_return_h5
  • close_return_h20
  • future_high_return_h20
  • future_low_return_h20

Why returns instead of raw prices:

  • easier normalization
  • better cross-ticker comparability
  • more stable optimization

At inference time, convert predicted returns back to price ranges using the latest close.

7. Baseline Model Architecture

Initial Architecture

Start with:

  • 6 transformer encoder layers
  • model dimension around 256
  • 8 attention heads
  • feed-forward dimension around 1024
  • dropout around 0.1
  • sequence length around 252 trading days

Input Layout

Use a global multi-ticker model:

  • one training sample = one ticker window
  • static ticker embedding
  • dynamic daily features across the lookback window

Output Head

Use multi-head regression:

  • one head per horizon
  • each head predicts:
    • future close return
    • future high return
    • future low return

Loss

Recommended V1 loss:

  • weighted Huber or MSE for returns

Recommended V2 improvement:

  • add range-consistency penalties so:
    • predicted low does not exceed predicted close
    • predicted high does not fall below predicted close
    • predicted high remains above predicted low

8. Why a Transformer Here

A transformer is attractive because it can:

  • look across long historical windows
  • model temporal relationships without only relying on recurrence
  • produce attention summaries that are easier to inspect than many alternatives
  • scale cleanly with global multi-ticker training

That said, the design should remain humble:

  • a transformer is not automatically better than TFT or LSTM
  • the benchmark must prove the value

9. Autonomous Iteration Loop

Principle

The system should not directly rewrite its own source code freely.

Instead, it should run a controller loop like this:

  1. Train current configuration
  2. Evaluate on validation and holdout sets
  3. Log metrics and interpretable artifacts
  4. Produce a structured summary for an LLM or coding agent
  5. Ask for a constrained next-step recommendation
  6. Apply only allowed changes
  7. Repeat until stop criteria are met

Allowed Hyperparameter Changes Per Iteration

Recommended initial search space:

  • number of layers: 4, 6, 8, 10
  • model dimension: 128, 256, 384
  • attention heads: 4, 8
  • dropout: 0.05, 0.10, 0.15, 0.20
  • learning rate
  • batch size
  • lookback window
  • sentiment feature on/off
  • GDELT feature on/off

Suggested LLM Prompt Contract

The controller should send structured context, not raw logs.

Example sections:

  • current config
  • previous 3 iteration summaries
  • validation metrics
  • overfitting indicators
  • calibration notes
  • attention summary
  • example predictions
  • GPU throughput and runtime

And ask:

  • Should we keep the architecture?
  • Should we increase or decrease layers?
  • Should we widen or narrow the model?
  • Should we turn on sentiment features yet?
  • Are we overfitting?

Stop Criteria

Define "optimal" as stabilized rather than globally optimal.

Recommended stopping rules:

  • maximum 8 to 15 full iterations
  • stop after 2 or 3 consecutive non-improving rounds
  • stop when improvement is smaller than a threshold such as:
    • delta_val_loss < 0.25%
    • delta_range_coverage < 0.2%

10. Human-Readable Internal State Logging

Important note:

These are not the model's literal "thoughts."
They are interpretable diagnostic views of internal behavior.

What to Log Each Iteration

  • configuration used
  • train / validation / holdout metrics
  • learning curves
  • attention maps for selected samples
  • embedding projection snapshots
  • sample predictions vs actual ranges
  • worst-case examples
  • best-case examples
  • calibration / uncertainty summaries
  • GPU memory, throughput, and epoch time

Human-Readable Artifacts

Recommended artifacts per iteration:

  • config.json
  • metrics.json
  • attention_summary.md
  • sample_predictions.csv
  • layer_probe_summary.md
  • training_curve.png
  • attention_heatmap_<sample>.png

Layer Probe Guidance

Keep this simple at first:

  • capture the first embedding projection
  • capture attention focus from the final encoder layer
  • capture 3 to 5 example hidden-state summaries
  • capture 5 sample predictions before and after training

Avoid building a giant interpretability platform in V1.

11. Metrics

Primary Metrics

  • MAE on future close return
  • MAE on future high return
  • MAE on future low return
  • RMSE on future close return
  • range coverage accuracy
  • directional accuracy for close return

Useful Secondary Metrics

  • calibration of predicted ranges
  • high-low spread error
  • correlation between predicted and realized returns
  • horizon-wise error breakdown
  • per-ticker error distribution

Selection Metric

Use a weighted composite selection score, for example:

selection_score =
  0.35 * normalized_close_mae +
  0.25 * normalized_high_mae +
  0.25 * normalized_low_mae +
  0.10 * normalized_range_coverage_error +
  0.05 * runtime_penalty

This prevents the controller from choosing a giant slow model for a tiny accuracy gain.

12. Proposed Project Structure

Recommended folder shape:

autotransformer/
  README.md
  KIRO_DESIGN.md
  Dockerfile
  requirements.txt
  config.py
  dataset.py
  features.py
  model.py
  losses.py
  trainer.py
  evaluate.py
  controller.py
  prompts/
    iteration_review.txt
  reports/
  artifacts/
  sql/
    schema.sql

For this first step, only the design package is required. The code scaffold can come next.

13. Runtime and CUDA Design

Container Base

Use:

nvcr.io/nvidia/pytorch:25.01-py3

Reason:

  • matches the existing DGX GPU workflow in this repository
  • keeps CUDA and PyTorch aligned
  • avoids host Python pollution

GPU Strategy

Version 1 should use:

  • single-GPU training on DGX / GB10
  • mixed precision if stable
  • gradient accumulation only if needed

This project does not need multi-GPU complexity on day one.

14. Logging and Persistence

Database Tables to Consider

Recommended tables:

  • autotransformer_run_log
  • autotransformer_iteration_log
  • autotransformer_model_registry
  • autotransformer_prediction_snapshots
  • autotransformer_attention_artifacts

Minimum Per-Iteration DB Fields

  • run_id
  • iteration_id
  • created_at
  • model_depth
  • model_dim
  • num_heads
  • dropout
  • learning_rate
  • batch_size
  • lookback_days
  • sentiment_enabled
  • gdelt_enabled
  • train_loss
  • val_loss
  • close_mae
  • high_mae
  • low_mae
  • range_coverage
  • directional_accuracy
  • training_minutes
  • gpu_memory_gb
  • llm_recommendation
  • controller_action

15. Risks and Caveats

Real Risk 1: "Optimal" Is Not Absolute

This project can converge to a strong local configuration. It cannot guarantee a universal optimum.

Real Risk 2: LLM Recommendations Can Be Noisy

The controller should advise within a bounded search space. It should not be treated as a mathematically reliable optimizer.

Real Risk 3: News Features Can Hurt Before They Help

Sentiment and GDELT inputs should be added after the OHLCV baseline works. Otherwise the search space becomes too large too early.

Real Risk 4: Interpretability Can Balloon Scope

Readable layer probes are useful. A full mechanistic interpretability system is not needed for V1.

Phase A — Foundation

  • build dataset pipeline
  • define targets
  • build baseline transformer
  • train a single fixed 6-layer model

Phase B — Benchmarking

  • compare 4 / 6 / 8 layers
  • compare 128 / 256 / 384 dimensions
  • log throughput and metrics

Phase C — Autonomous Controller

  • add structured iteration summaries
  • add LLM review prompt
  • constrain controller actions
  • stop on stabilization criteria

Phase D — News Context

  • add newsflow_features
  • add GDELT aggregates
  • re-run bounded search

Phase E — Dashboard and Reporting

  • model catalog page
  • experiment history
  • architecture evolution view
  • attention and sample prediction gallery

17. Approximate Delivery Timeline

These are realistic working estimates, not guarantees.

Design and Scaffold

  • 0.5 to 1.5 days

First Trainable Baseline

  • 2 to 4 days

First Full Controlled Experiment Loop

  • 3 to 7 days

Adding Newsflow + GDELT Features

  • 2 to 5 days

Usable Dashboard / Reporting Layer

  • 1 to 3 days

Total to First Stabilized Version

  • roughly 1 to 3 weeks of focused work

18. Approximate Training Runtime on DGX / GB10

Actual runtime depends on:

  • lookback length
  • number of features
  • batch size
  • number of horizons
  • whether sentiment features are enabled
  • how many experiments are allowed

Reasonable estimate for V1:

  • one baseline training cycle: 1 to 4 hours
  • one larger comparison round: 4 to 12 hours
  • one bounded search run of 8 to 12 iterations: 1 to 4 days wall clock

If early stopping and pruning are implemented well, the lower end is realistic.

19. Definition of "Good Enough"

For this project, good enough means:

  • reproducible train / validation / holdout split
  • readable experiment history
  • stable bounded controller loop
  • no major metric improvement over several rounds
  • final chosen architecture recorded with evidence
  • sample predictions and attention summaries visible

That is the practical definition of "optimal state" for V1.

If we keep this small and smart, the first implementation should be:

  1. Global OHLCV-only transformer
  2. 6 layers
  3. 252-day lookback
  4. horizons 1, 5, 20
  5. single-GPU CUDA training
  6. structured iteration logs to Postgres + files
  7. one bounded controller that can only change:
    • layers
    • model dimension
    • dropout
    • learning rate

Do not start with:

  • full sentiment integration
  • GDELT event taxonomy
  • multi-GPU training
  • open-ended agent refactors

21. Final Recommendation

Yes, this project is feasible on your NVIDIA setup.

The best way to do it is:

  • start with a narrow, readable baseline
  • use a bounded controller rather than unrestricted autonomy
  • treat LLM feedback as experiment guidance, not as truth
  • define stabilization instead of pretending to reach a universal optimum
  • add sentiment and GDELT only after the OHLCV baseline is stable

The estimated path to a strong first stabilized version is:

  • roughly 1 to 3 weeks

The estimated path to a richer mature version with news context, reporting, and repeated controller runs is:

  • roughly 3 to 6 weeks

22. Acceptance Criteria for Version 1

  • autotransformer/ design package exists
  • baseline transformer can train on CUDA
  • iteration logs are written every cycle
  • LLM/controller recommendations are stored
  • architecture changes are bounded and traceable
  • final selected architecture is reported
  • final evaluation metrics are exported
  • internal states are summarized in a human-readable way