From Daily Bars to Market Regime Intelligence on a DGX
Introduction
Most trading systems start too late in the pipeline.
They begin with a model and ask, "What should we predict?" before they can clearly answer:
- what market state are we in
- which names are moving together
- which names are behaving abnormally
- whether a move is part of a broad regime or a stock-specific event
For a Nasdaq-100 universe with 20 years of daily OHLCV data, a better first project is not "predict tomorrow's close." A better first project is to build market regime intelligence.
That means building a system that can:
- detect bullish, bearish, transitional, and high-volatility states
- group stocks by shared behavior
- surface names that detach from their peers
- estimate ranges of outcomes instead of overconfident point predictions
- send the most interesting anomalies to a human or news layer for review
This document lays out that design for dgx-trading-system.
The aim is twofold:
- improve market understanding
- improve model understanding
Those are different goals, and the system should support both.
The Core Idea
The market is not one environment. It shifts.
Some periods are:
- broad bullish trends with tight leadership
- broad bullish trends with low volatility
- narrow rallies led by a few megacaps
- panic regimes with rising correlation
- mean-reverting churn
- post-shock recovery transitions
Trying to fit one model across all of these without first identifying the state usually produces fragile conclusions.
So the project should work in layers:
- detect regimes
- find correlated groups inside each regime
- detect outliers relative to the group
- estimate the range of forward outcomes given the setup
- cross-check outliers against market news and events
This layered design is far more useful than a single "BUY/SELL" model.
Why Regimes Matter
A stock can behave very differently depending on the environment.
For example:
- In a broad risk-on regime, semiconductors may move as a cluster and respond to momentum.
- In a panic regime, correlation may spike and stock-specific factors matter less.
- In a post-earnings dispersion regime, stock-specific news can dominate sector behavior.
If you do not condition on regime, you mix together incompatible behaviors and pollute both your features and your labels.
Regime labeling helps answer:
- Is the current move broad or narrow?
- Is correlation rising or breaking apart?
- Are leaders stable or rotating?
- Should a stock's move be judged relative to the market, sector, or its own cluster?
That is why regime detection belongs near the front of the pipeline.
Why Correlated Groups Matter
Raw correlation is informative but incomplete.
If two names are highly correlated, that alone does not tell you:
- whether they are part of a stable cluster
- whether their relationship is regime-specific
- whether one of them is currently behaving unusually
Clustering and graph-style thinking help because the market is not just a list of tickers. It is a network.
In a good market intelligence system, you do not just ask:
"Is NVDA up?"
You ask:
- Is NVDA strong relative to semis?
- Is the semiconductor cluster itself strong?
- Is the cluster central to the whole market right now?
- Is NVDA acting like its cluster, or detaching from it?
That shift in framing is where a lot of useful intelligence comes from.
Why Outliers Matter
Average behavior is often less interesting than abnormal behavior.
Suppose most of a cluster is flat, but one stock is surging on volume, breaking its normal correlation profile, and moving to the edge of the feature distribution. That is often where the most actionable signal is.
Outliers can mean:
- company-specific news
- earnings surprises
- product launches
- analyst revisions
- policy or regulatory events
- unusual positioning or forced flows
- early leadership rotation
The correct response to an outlier is not immediate trust. The correct response is investigation.
That is why the system should send strong outliers into a review queue for news cross-checking.
Why Range-of-Outcomes Beats Point Prediction
Markets are noisy, and next-day direction is a weak target on its own.
A point forecast like "tomorrow return = 0.62%" sounds precise, but it usually hides the uncertainty that matters most.
For practical research, it is usually more useful to know:
- downside tail
- upside tail
- median expected path
- uncertainty width under current regime
- whether historical outcomes were stable or chaotic under similar conditions
That is why the system should output scenario ranges such as:
- 10th percentile return
- 25th percentile return
- median return
- 75th percentile return
- 90th percentile return
This is both more honest and more useful.
Project Architecture
The project should remain modular and Dockerized.
Existing Foundation
dgx-trading-system already has:
- PostgreSQL
- daily ingestion for a large Nasdaq-100 universe
- Dockerized workloads
- NVIDIA-capable model containers
That is enough to build the next layer without changing ingestion.
Recommended New Modules
analytics/regime/
features.py
market_features.py
regime_hmm.py
change_points.py
pca_factors.py
correlations.py
clusters.py
outliers.py
run_features.py
run_regimes.py
run_clusters.py
run_outliers.py
summarize.py
model/outcomes/
dataset.py
targets.py
baseline_models.py
quantile_model.py
montecarlo.py
train.py
train_universe.py
predict.py
predict_universe.py
This preserves the existing separation:
- ingestion stays ingestion
- analytics stays analytics
- models stay models
- reporting stays reporting
That is a healthy design for both production and learning.
The Data Model
The system should not dump everything into one table.
Instead, it should separate:
- raw prices
- derived features
- regime labels
- relationship graphs / clusters
- outlier events
- forecast ranges
- review metadata
That gives you traceability.
Key Tables
regime_feature_daily
Stores ticker-level engineered features. This is the substrate used by regime, cluster, outlier, and scenario models.
market_regime_state
Stores market-level hidden states by date and horizon. This lets you ask:
- what regime were we in on this date?
- did it change recently?
- how confident was the regime model?
correlation_cluster_daily
Stores cluster membership by horizon and date. This turns raw correlation into a usable structure.
outlier_event_daily
Stores names behaving abnormally relative to their current cluster or own history.
scenario_forecast_daily
Stores range-based forward views instead of only a directional guess.
news_review_queue
Stores items for manual or later automated event review.
This table is where quant research hands off to qualitative research.
Feature Engineering
Feature engineering should start with transparent, interpretable signals.
That matters because if later models behave strangely, you need to know whether the problem came from:
- the raw data
- the feature logic
- the regime model
- the clustering step
- the forecasting model
Interpretable features make that possible.
Recommended Row-Level Features
- 1-day log return
- 5-day log return
- 20-day log return
- realized volatility over 20 and 60 days
- ATR(14)
- RSI(14)
- MACD histogram
- Bollinger %B
- volume ratio versus 20-day average
- high-low range normalized by close
- close-to-open return
- 5-day rate of change
- distance from SMA20 and SMA60
- beta to QQQ over 60 days
- correlation to QQQ over 60 days
- relative strength versus QQQ
- rolling max drawdown
Recommended Market-Level Features
Aggregate universe signals such as:
- fraction of tickers above SMA20 / SMA60 / SMA200
- average cross-sectional return
- average realized volatility
- median correlation
- dispersion of returns
- top-decile versus bottom-decile spread
- share of positive 20-day momentum names
These features are more useful for regime detection than looking at QQQ alone.
Regime Detection Theory
There is no single perfect regime algorithm, so the system should use a small stack of complementary methods.
Hidden Markov Models
HMMs are a strong first choice because they model:
- latent states
- transition persistence
- observation noise
For example, the system might learn states like:
- low-volatility bullish trend
- high-volatility bullish recovery
- defensive or bearish regime
- transition regime
The exact names come after fitting. The data determines the hidden states; the human interprets them.
Change-Point Detection
HMMs are good at persistent states, but they do not always mark boundaries cleanly.
Change-point detection complements them by identifying moments where structure shifts:
- a volatility break
- a trend break
- a sudden change in cross-sectional dispersion
This is helpful when you want to answer:
"When did the new regime begin?"
Rolling PCA
Rolling PCA is not a regime model by itself, but it is very useful for diagnosing structure.
It tells you:
- whether one factor dominates the market
- whether leadership is broad or narrow
- whether the factor structure is stable or changing
That makes it excellent for interpreting "bullish" conditions:
- broad bullish leadership
- concentrated bullish leadership
- unstable rally with rising dispersion
Those are different environments, and rolling PCA helps reveal that.
Clustering and Market Topology
Correlation matrices are useful, but cluster structure is often more actionable than the raw matrix.
Recommended Starting Method
Hierarchical clustering on rolling correlation distance:
distance(a, b) = sqrt(0.5 * (1 - corr(a, b)))
Why start here:
- interpretable
- stable enough for daily research
- good for visual diagnostics
- easy to compare across horizons
What To Look For
- which clusters persist across windows
- which clusters change sharply across regime transitions
- whether leadership is dominated by one cluster
- whether a stock jumps between clusters frequently
Cluster instability can itself be useful information.
Outlier Detection Theory
There is no single outlier.
There are different kinds of unusual behavior, and each suggests a different story.
1. Return Outliers
These are names whose returns are abnormally strong or weak relative to their cluster peers.
Use:
- robust z-score
- median absolute deviation
This is simple and useful.
2. Correlation Outliers
These are names whose co-movement structure breaks down.
Examples:
- a stock normally moves with semiconductors but suddenly disconnects
- a cluster leader stops behaving like the cluster
- a defensive name starts trading like a momentum name
This kind of outlier is often particularly valuable.
3. Feature-Space Outliers
These are names whose entire behavior profile becomes unusual.
Use:
- Mahalanobis distance
- later, maybe Isolation Forest
This captures multi-dimensional abnormality better than one indicator can.
Why Outliers Need News Review
An outlier may be:
- a real event
- a data problem
- an artifact of an unstable cluster
- a temporary flow-driven move
That is why the correct pattern is:
- quantify the anomaly
- rank the severity
- send it for review
- tag the driver if confirmed
That process turns raw anomaly detection into intelligence.
Outcome Modeling Theory
Once the system knows:
- current regime
- current cluster
- outlier status
- current feature state
it can estimate what typically happened next under similar conditions.
Best First Step: Conditional Historical Distributions
This is often the highest-value first model.
Condition on:
- regime
- cluster
- volatility bucket
- relative strength bucket
Then compute forward returns at:
- 5 days
- 20 days
- 60 days
This alone can produce useful scenario tables.
Quantile Regression
Then move to quantile regression, which predicts:
- p10
- p25
- p50
- p75
- p90
This is much better aligned with real market uncertainty than a single target.
Baseline Models Before Deep Models
Before using a GPU-heavy architecture, build:
- linear / ridge baseline
- tree-based baseline
- quantile regression baseline
Why:
- they are fast
- they are interpretable
- they set a standard the GPU model must beat
When To Use the GPU
The DGX should be used deliberately.
Good uses:
- multi-horizon quantile neural nets
- sequence models if temporal dependencies prove helpful
- batched inference across many tickers
Bad early use:
- using a GPU because it feels advanced
- replacing simple methods before measuring them
The right rule is:
use the GPU when it buys better calibration, better performance, or better research throughput.
Why Not Start With CUDA / C++
For this project phase, Python is the right language.
Use:
pandasnumpyscikit-learnnetworkxor graph tooling if neededPyTorchwhere deep models are justified
Avoid starting with CUDA/C++ because:
- the main risk is not compute speed
- the main risk is wrong problem framing
- the main bottleneck is feature and target design
- low-level optimization will slow learning when the research loop is still changing
That said, CUDA can be a very good second-stage learning project.
Good later CUDA experiments:
- rolling-window feature kernels
- pairwise correlation acceleration
- Monte Carlo path simulation
- large-batch inference kernels
That is a much better learning path than rewriting the first version in C++.
A Day-By-Day Build Strategy
Day 7
Freeze architecture, schemas, and container boundaries.
Day 8
Build the feature store and validate it by hand.
Day 9
Fit regime models and inspect historical transitions.
Day 10
Build cluster and market topology views.
Day 11
Detect outliers and create the review queue.
Day 12
Build baseline scenario models and quantile ranges.
Day 13
Add GPU-enabled scenario models only if baselines are stable.
Day 14
Build daily reporting and research output.
This order matters because it keeps the system explainable at every step.
Dockerized Pipeline Design
This project should remain one-job-per-container.
CPU Containers
- feature generation
- regime detection
- clustering
- outlier detection
- report generation
These are parallelizable and should scale across CPU threads.
GPU Containers
- scenario model training
- batched scenario inference
These should be isolated so the NVIDIA runtime is only used where it adds value.
Shared Storage
Use:
- PostgreSQL for structured outputs
- named volume for model artifacts
- optional report artifact volume for markdown/PDF outputs
This keeps the pipeline reproducible and portable.
What Success Looks Like
The project is succeeding when you can open a daily report and quickly see:
- the current regime at multiple horizons
- whether the market is broadening or narrowing
- which clusters are strongest
- which names are unusual
- what the scenario bands say for those names
- which outliers should be checked against news
At that point, the system is doing what it should:
- it is not pretending to know the future exactly
- it is narrowing attention intelligently
- it is teaching you how the market is structured
- it is teaching you how the models behave
That is the right foundation.
Final Recommendation
Build this as a market intelligence system first and a forecasting system second.
That order will make the project:
- more robust
- easier to debug
- more educational
- more useful in practice
The most important discipline is this:
do not let the desire for a sophisticated model outrun the quality of the problem definition.
If the system can reliably identify regime changes, cluster structure, outliers, and scenario ranges, then later additions like news scoring, LLM summaries, or deeper GPU models will have a much stronger foundation.