AutoTransformer — Kiro-Style Design Document
autotransformer
Target Runtime: NVIDIA CUDA via Docker on DGX / GB10
Status: Design approved for first implementation slice1. Context
We want a local-first autonomous transformer project that predicts future stock ranges using:
- roughly 20 years of daily market data
- around 100 stocks
- OHLCV as the core signal
- Finnhub, Interactive Brokers, and GDELT as optional context features
The model should:
- start with a 6-layer transformer and add or remove layers as needed
- train on 80% of data and use the remaining for validation and backtesting as applicable
- train on NVIDIA GPU
- log internal states in a human-readable way
- evaluate itself after each training cycle
- ask an LLM or coding agent to review metrics and suggest the next adjustment
- continue iterating until performance stabilizes
- automatically adjust hyperparameters based on performance metrics
- Log time taken to run through layers and epochs for dashboarding
- Use any technical indicators data already present without over fitting or creating bias
This project should remain understandable to a typical programmer. Readability is a requirement, not a nice-to-have.
2. Problem Statement
Predict future high, low, and close ranges for multiple forward horizons from daily historical data.
The immediate design goal is not "perfect autonomy." The immediate design goal is:
- a stable training loop
- clear experiment logs
- architecture search with guardrails
- visible internal behavior
- reproducible final model selection
3. Objectives
Primary Objectives
- Build a transformer-based forecasting system for future price ranges.
- Support multi-horizon outputs such as:
- next day
- next 5 trading days
- next 20 trading days
- Run the full training and evaluation cycle on CUDA-enabled NVIDIA hardware.
- Log model behavior, training metrics, attention summaries, and sample predictions per iteration.
- Add a bounded architecture controller that can adjust depth and selected hyperparameters after each cycle.
Secondary Objectives
- Incorporate sentiment and macro/news context from:
- Finnhub
- Interactive Brokers
- GDELT
- Expose a human-readable experiment history that explains:
- what changed
- why it changed
- whether it helped
Non-Goals for Version 1
- Full AutoML across every possible architecture
- Tick-level or intraday modeling
- Reinforcement learning
- Unbounded autonomous self-modification
- Production live-trading execution
4. Key Design Decision
Use a bounded autonomous search loop, not an unbounded self-tuning system.
That means:
- start from a known baseline
- allow only a controlled set of changes per iteration
- require metric-based justification for each change
- stop after a small number of non-improving rounds
This is safer, easier to debug, and much more readable than a free-form agent loop.
5. Proposed Data Inputs
Core Inputs
Per ticker, per day:
openhighlowclosevolume
Derived Market Features
Recommended first-pass engineered features:
- daily return
- log return
- intraday range
- close-to-high distance
- close-to-low distance
- rolling volatility
- rolling volume z-score
- moving average ratios
- ATR-style range features
- gap features
Cross-Sectional / Context Features
Use a global model, not 100 separate models, for the first implementation.
Add:
- ticker embedding
- sector or group embedding if available
- market-wide context features from indices or ETF proxies later if needed
News / Sentiment Features
Use news context as optional exogenous inputs, not as the first dependency.
Recommended order:
- OHLCV-only baseline
- Add
newsflow_featuresfrom the Sentiment Engine - Add GDELT macro tone features separately
Suggested sentiment fields:
- 24h
score_flow - 72h
score_flow - article count
- positive ratio
- negative ratio
- source disagreement
Suggested GDELT fields:
- rolling mean tone
- rolling tone z-score
- event volume
- macro stress proxy counts
6. Prediction Targets
Predict future range values over multiple horizons.
Recommended target formulation:
close_return_h1future_high_return_h1future_low_return_h1close_return_h5future_high_return_h5future_low_return_h5close_return_h20future_high_return_h20future_low_return_h20
Why returns instead of raw prices:
- easier normalization
- better cross-ticker comparability
- more stable optimization
At inference time, convert predicted returns back to price ranges using the latest close.
7. Baseline Model Architecture
Initial Architecture
Start with:
- 6 transformer encoder layers
- model dimension around
256 - 8 attention heads
- feed-forward dimension around
1024 - dropout around
0.1 - sequence length around
252trading days
Input Layout
Use a global multi-ticker model:
- one training sample = one ticker window
- static ticker embedding
- dynamic daily features across the lookback window
Output Head
Use multi-head regression:
- one head per horizon
- each head predicts:
- future close return
- future high return
- future low return
Loss
Recommended V1 loss:
- weighted Huber or MSE for returns
Recommended V2 improvement:
- add range-consistency penalties so:
- predicted low does not exceed predicted close
- predicted high does not fall below predicted close
- predicted high remains above predicted low
8. Why a Transformer Here
A transformer is attractive because it can:
- look across long historical windows
- model temporal relationships without only relying on recurrence
- produce attention summaries that are easier to inspect than many alternatives
- scale cleanly with global multi-ticker training
That said, the design should remain humble:
- a transformer is not automatically better than TFT or LSTM
- the benchmark must prove the value
9. Autonomous Iteration Loop
Principle
The system should not directly rewrite its own source code freely.
Instead, it should run a controller loop like this:
- Train current configuration
- Evaluate on validation and holdout sets
- Log metrics and interpretable artifacts
- Produce a structured summary for an LLM or coding agent
- Ask for a constrained next-step recommendation
- Apply only allowed changes
- Repeat until stop criteria are met
Allowed Hyperparameter Changes Per Iteration
Recommended initial search space:
- number of layers:
4,6,8,10 - model dimension:
128,256,384 - attention heads:
4,8 - dropout:
0.05,0.10,0.15,0.20 - learning rate
- batch size
- lookback window
- sentiment feature on/off
- GDELT feature on/off
Suggested LLM Prompt Contract
The controller should send structured context, not raw logs.
Example sections:
- current config
- previous 3 iteration summaries
- validation metrics
- overfitting indicators
- calibration notes
- attention summary
- example predictions
- GPU throughput and runtime
And ask:
- Should we keep the architecture?
- Should we increase or decrease layers?
- Should we widen or narrow the model?
- Should we turn on sentiment features yet?
- Are we overfitting?
Stop Criteria
Define "optimal" as stabilized rather than globally optimal.
Recommended stopping rules:
- maximum
8to15full iterations - stop after
2or3consecutive non-improving rounds - stop when improvement is smaller than a threshold such as:
delta_val_loss < 0.25%delta_range_coverage < 0.2%
10. Human-Readable Internal State Logging
Important note:
What to Log Each Iteration
- configuration used
- train / validation / holdout metrics
- learning curves
- attention maps for selected samples
- embedding projection snapshots
- sample predictions vs actual ranges
- worst-case examples
- best-case examples
- calibration / uncertainty summaries
- GPU memory, throughput, and epoch time
Human-Readable Artifacts
Recommended artifacts per iteration:
config.jsonmetrics.jsonattention_summary.mdsample_predictions.csvlayer_probe_summary.mdtraining_curve.pngattention_heatmap_<sample>.png
Layer Probe Guidance
Keep this simple at first:
- capture the first embedding projection
- capture attention focus from the final encoder layer
- capture 3 to 5 example hidden-state summaries
- capture 5 sample predictions before and after training
Avoid building a giant interpretability platform in V1.
11. Metrics
Primary Metrics
- MAE on future close return
- MAE on future high return
- MAE on future low return
- RMSE on future close return
- range coverage accuracy
- directional accuracy for close return
Useful Secondary Metrics
- calibration of predicted ranges
- high-low spread error
- correlation between predicted and realized returns
- horizon-wise error breakdown
- per-ticker error distribution
Selection Metric
Use a weighted composite selection score, for example:
selection_score =
0.35 * normalized_close_mae +
0.25 * normalized_high_mae +
0.25 * normalized_low_mae +
0.10 * normalized_range_coverage_error +
0.05 * runtime_penalty
This prevents the controller from choosing a giant slow model for a tiny accuracy gain.
12. Proposed Project Structure
Recommended folder shape:
autotransformer/
README.md
KIRO_DESIGN.md
Dockerfile
requirements.txt
config.py
dataset.py
features.py
model.py
losses.py
trainer.py
evaluate.py
controller.py
prompts/
iteration_review.txt
reports/
artifacts/
sql/
schema.sql
For this first step, only the design package is required. The code scaffold can come next.
13. Runtime and CUDA Design
Container Base
Use:
nvcr.io/nvidia/pytorch:25.01-py3
Reason:
- matches the existing DGX GPU workflow in this repository
- keeps CUDA and PyTorch aligned
- avoids host Python pollution
GPU Strategy
Version 1 should use:
- single-GPU training on DGX / GB10
- mixed precision if stable
- gradient accumulation only if needed
This project does not need multi-GPU complexity on day one.
14. Logging and Persistence
Database Tables to Consider
Recommended tables:
autotransformer_run_logautotransformer_iteration_logautotransformer_model_registryautotransformer_prediction_snapshotsautotransformer_attention_artifacts
Minimum Per-Iteration DB Fields
run_iditeration_idcreated_atmodel_depthmodel_dimnum_headsdropoutlearning_ratebatch_sizelookback_dayssentiment_enabledgdelt_enabledtrain_lossval_lossclose_maehigh_maelow_maerange_coveragedirectional_accuracytraining_minutesgpu_memory_gbllm_recommendationcontroller_action
15. Risks and Caveats
Real Risk 1: "Optimal" Is Not Absolute
This project can converge to a strong local configuration. It cannot guarantee a universal optimum.
Real Risk 2: LLM Recommendations Can Be Noisy
The controller should advise within a bounded search space. It should not be treated as a mathematically reliable optimizer.
Real Risk 3: News Features Can Hurt Before They Help
Sentiment and GDELT inputs should be added after the OHLCV baseline works. Otherwise the search space becomes too large too early.
Real Risk 4: Interpretability Can Balloon Scope
Readable layer probes are useful. A full mechanistic interpretability system is not needed for V1.
16. Recommended Build Phases
Phase A — Foundation
- build dataset pipeline
- define targets
- build baseline transformer
- train a single fixed 6-layer model
Phase B — Benchmarking
- compare 4 / 6 / 8 layers
- compare 128 / 256 / 384 dimensions
- log throughput and metrics
Phase C — Autonomous Controller
- add structured iteration summaries
- add LLM review prompt
- constrain controller actions
- stop on stabilization criteria
Phase D — News Context
- add
newsflow_features - add GDELT aggregates
- re-run bounded search
Phase E — Dashboard and Reporting
- model catalog page
- experiment history
- architecture evolution view
- attention and sample prediction gallery
17. Approximate Delivery Timeline
These are realistic working estimates, not guarantees.
Design and Scaffold
0.5to1.5days
First Trainable Baseline
2to4days
First Full Controlled Experiment Loop
3to7days
Adding Newsflow + GDELT Features
2to5days
Usable Dashboard / Reporting Layer
1to3days
Total to First Stabilized Version
- roughly
1to3weeks of focused work
18. Approximate Training Runtime on DGX / GB10
Actual runtime depends on:
- lookback length
- number of features
- batch size
- number of horizons
- whether sentiment features are enabled
- how many experiments are allowed
Reasonable estimate for V1:
- one baseline training cycle:
1to4hours - one larger comparison round:
4to12hours - one bounded search run of
8to12iterations:1to4days wall clock
If early stopping and pruning are implemented well, the lower end is realistic.
19. Definition of "Good Enough"
For this project, good enough means:
- reproducible train / validation / holdout split
- readable experiment history
- stable bounded controller loop
- no major metric improvement over several rounds
- final chosen architecture recorded with evidence
- sample predictions and attention summaries visible
That is the practical definition of "optimal state" for V1.
20. Recommended First Implementation Slice
If we keep this small and smart, the first implementation should be:
- Global OHLCV-only transformer
- 6 layers
- 252-day lookback
- horizons
1,5,20 - single-GPU CUDA training
- structured iteration logs to Postgres + files
- one bounded controller that can only change:
- layers
- model dimension
- dropout
- learning rate
Do not start with:
- full sentiment integration
- GDELT event taxonomy
- multi-GPU training
- open-ended agent refactors
21. Final Recommendation
Yes, this project is feasible on your NVIDIA setup.
The best way to do it is:
- start with a narrow, readable baseline
- use a bounded controller rather than unrestricted autonomy
- treat LLM feedback as experiment guidance, not as truth
- define stabilization instead of pretending to reach a universal optimum
- add sentiment and GDELT only after the OHLCV baseline is stable
The estimated path to a strong first stabilized version is:
- roughly
1to3weeks
The estimated path to a richer mature version with news context, reporting, and repeated controller runs is:
- roughly
3to6weeks
22. Acceptance Criteria for Version 1
autotransformer/design package exists- baseline transformer can train on CUDA
- iteration logs are written every cycle
- LLM/controller recommendations are stored
- architecture changes are bounded and traceable
- final selected architecture is reported
- final evaluation metrics are exported
- internal states are summarized in a human-readable way