From Daily Bars to Market Regime Intelligence on a DGX

Introduction

Most trading systems start too late in the pipeline.

They begin with a model and ask, "What should we predict?" before they can clearly answer:

what market state are we in
which names are moving together
which names are behaving abnormally
whether a move is part of a broad regime or a stock-specific event

For a Nasdaq-100 universe with 20 years of daily OHLCV data, a better first project is not "predict tomorrow's close." A better first project is to build market regime intelligence.

That means building a system that can:

detect bullish, bearish, transitional, and high-volatility states
group stocks by shared behavior
surface names that detach from their peers
estimate ranges of outcomes instead of overconfident point predictions
send the most interesting anomalies to a human or news layer for review

This document lays out that design for dgx-trading-system.

The aim is twofold:

improve market understanding
improve model understanding

Those are different goals, and the system should support both.

The Core Idea

The market is not one environment. It shifts.

Some periods are:

broad bullish trends with tight leadership
broad bullish trends with low volatility
narrow rallies led by a few megacaps
panic regimes with rising correlation
mean-reverting churn
post-shock recovery transitions

Trying to fit one model across all of these without first identifying the state usually produces fragile conclusions.

So the project should work in layers:

detect regimes
find correlated groups inside each regime
detect outliers relative to the group
estimate the range of forward outcomes given the setup
cross-check outliers against market news and events

This layered design is far more useful than a single "BUY/SELL" model.

Why Regimes Matter

A stock can behave very differently depending on the environment.

For example:

In a broad risk-on regime, semiconductors may move as a cluster and respond to momentum.
In a panic regime, correlation may spike and stock-specific factors matter less.
In a post-earnings dispersion regime, stock-specific news can dominate sector behavior.

If you do not condition on regime, you mix together incompatible behaviors and pollute both your features and your labels.

Regime labeling helps answer:

Is the current move broad or narrow?
Is correlation rising or breaking apart?
Are leaders stable or rotating?
Should a stock's move be judged relative to the market, sector, or its own cluster?

That is why regime detection belongs near the front of the pipeline.

Why Correlated Groups Matter

Raw correlation is informative but incomplete.

If two names are highly correlated, that alone does not tell you:

whether they are part of a stable cluster
whether their relationship is regime-specific
whether one of them is currently behaving unusually

Clustering and graph-style thinking help because the market is not just a list of tickers. It is a network.

In a good market intelligence system, you do not just ask:

"Is NVDA up?"

You ask:

Is NVDA strong relative to semis?
Is the semiconductor cluster itself strong?
Is the cluster central to the whole market right now?
Is NVDA acting like its cluster, or detaching from it?

That shift in framing is where a lot of useful intelligence comes from.

Why Outliers Matter

Average behavior is often less interesting than abnormal behavior.

Suppose most of a cluster is flat, but one stock is surging on volume, breaking its normal correlation profile, and moving to the edge of the feature distribution. That is often where the most actionable signal is.

Outliers can mean:

company-specific news
earnings surprises
product launches
analyst revisions
policy or regulatory events
unusual positioning or forced flows
early leadership rotation

The correct response to an outlier is not immediate trust. The correct response is investigation.

That is why the system should send strong outliers into a review queue for news cross-checking.

Why Range-of-Outcomes Beats Point Prediction

Markets are noisy, and next-day direction is a weak target on its own.

A point forecast like "tomorrow return = 0.62%" sounds precise, but it usually hides the uncertainty that matters most.

For practical research, it is usually more useful to know:

downside tail
upside tail
median expected path
uncertainty width under current regime
whether historical outcomes were stable or chaotic under similar conditions

That is why the system should output scenario ranges such as:

10th percentile return
25th percentile return
median return
75th percentile return
90th percentile return

This is both more honest and more useful.

Project Architecture

The project should remain modular and Dockerized.

Existing Foundation

dgx-trading-system already has:

PostgreSQL
daily ingestion for a large Nasdaq-100 universe
Dockerized workloads
NVIDIA-capable model containers

That is enough to build the next layer without changing ingestion.

Recommended New Modules

analytics/regime/
  features.py
  market_features.py
  regime_hmm.py
  change_points.py
  pca_factors.py
  correlations.py
  clusters.py
  outliers.py
  run_features.py
  run_regimes.py
  run_clusters.py
  run_outliers.py
  summarize.py

model/outcomes/
  dataset.py
  targets.py
  baseline_models.py
  quantile_model.py
  montecarlo.py
  train.py
  train_universe.py
  predict.py
  predict_universe.py

This preserves the existing separation:

ingestion stays ingestion
analytics stays analytics
models stay models
reporting stays reporting

That is a healthy design for both production and learning.

The Data Model

The system should not dump everything into one table.

Instead, it should separate:

raw prices
derived features
regime labels
relationship graphs / clusters
outlier events
forecast ranges
review metadata

That gives you traceability.

Key Tables

`regime_feature_daily`

Stores ticker-level engineered features. This is the substrate used by regime, cluster, outlier, and scenario models.

`market_regime_state`

Stores market-level hidden states by date and horizon. This lets you ask:

what regime were we in on this date?
did it change recently?
how confident was the regime model?

`correlation_cluster_daily`

Stores cluster membership by horizon and date. This turns raw correlation into a usable structure.

`outlier_event_daily`

Stores names behaving abnormally relative to their current cluster or own history.

`scenario_forecast_daily`

Stores range-based forward views instead of only a directional guess.

`news_review_queue`

Stores items for manual or later automated event review.

This table is where quant research hands off to qualitative research.

Feature Engineering

Feature engineering should start with transparent, interpretable signals.

That matters because if later models behave strangely, you need to know whether the problem came from:

the raw data
the feature logic
the regime model
the clustering step
the forecasting model

Interpretable features make that possible.

Recommended Row-Level Features

1-day log return
5-day log return
20-day log return
realized volatility over 20 and 60 days
ATR(14)
RSI(14)
MACD histogram
Bollinger %B
volume ratio versus 20-day average
high-low range normalized by close
close-to-open return
5-day rate of change
distance from SMA20 and SMA60
beta to QQQ over 60 days
correlation to QQQ over 60 days
relative strength versus QQQ
rolling max drawdown

Recommended Market-Level Features

Aggregate universe signals such as:

fraction of tickers above SMA20 / SMA60 / SMA200
average cross-sectional return
average realized volatility
median correlation
dispersion of returns
top-decile versus bottom-decile spread
share of positive 20-day momentum names

These features are more useful for regime detection than looking at QQQ alone.

Regime Detection Theory

There is no single perfect regime algorithm, so the system should use a small stack of complementary methods.

Hidden Markov Models

HMMs are a strong first choice because they model:

latent states
transition persistence
observation noise

For example, the system might learn states like:

low-volatility bullish trend
high-volatility bullish recovery
defensive or bearish regime
transition regime

The exact names come after fitting. The data determines the hidden states; the human interprets them.

Change-Point Detection

HMMs are good at persistent states, but they do not always mark boundaries cleanly.

Change-point detection complements them by identifying moments where structure shifts:

a volatility break
a trend break
a sudden change in cross-sectional dispersion

This is helpful when you want to answer:

"When did the new regime begin?"

Rolling PCA

Rolling PCA is not a regime model by itself, but it is very useful for diagnosing structure.

It tells you:

whether one factor dominates the market
whether leadership is broad or narrow
whether the factor structure is stable or changing

That makes it excellent for interpreting "bullish" conditions:

broad bullish leadership
concentrated bullish leadership
unstable rally with rising dispersion

Those are different environments, and rolling PCA helps reveal that.

Clustering and Market Topology

Correlation matrices are useful, but cluster structure is often more actionable than the raw matrix.

Recommended Starting Method

Hierarchical clustering on rolling correlation distance:

distance(a, b) = sqrt(0.5 * (1 - corr(a, b)))

Why start here:

interpretable
stable enough for daily research
good for visual diagnostics
easy to compare across horizons

What To Look For

which clusters persist across windows
which clusters change sharply across regime transitions
whether leadership is dominated by one cluster
whether a stock jumps between clusters frequently

Cluster instability can itself be useful information.

Outlier Detection Theory

There is no single outlier.

There are different kinds of unusual behavior, and each suggests a different story.

1. Return Outliers

These are names whose returns are abnormally strong or weak relative to their cluster peers.

Use:

robust z-score
median absolute deviation

This is simple and useful.

2. Correlation Outliers

These are names whose co-movement structure breaks down.

Examples:

a stock normally moves with semiconductors but suddenly disconnects
a cluster leader stops behaving like the cluster
a defensive name starts trading like a momentum name

This kind of outlier is often particularly valuable.

3. Feature-Space Outliers

These are names whose entire behavior profile becomes unusual.

Use:

Mahalanobis distance
later, maybe Isolation Forest

This captures multi-dimensional abnormality better than one indicator can.

Why Outliers Need News Review

An outlier may be:

a real event
a data problem
an artifact of an unstable cluster
a temporary flow-driven move

That is why the correct pattern is:

quantify the anomaly
rank the severity
send it for review
tag the driver if confirmed

That process turns raw anomaly detection into intelligence.

Outcome Modeling Theory

Once the system knows:

current regime
current cluster
outlier status
current feature state

it can estimate what typically happened next under similar conditions.

Best First Step: Conditional Historical Distributions

This is often the highest-value first model.

Condition on:

regime
cluster
volatility bucket
relative strength bucket

Then compute forward returns at:

5 days
20 days
60 days

This alone can produce useful scenario tables.

Quantile Regression

Then move to quantile regression, which predicts:

This is much better aligned with real market uncertainty than a single target.

Baseline Models Before Deep Models

Before using a GPU-heavy architecture, build:

linear / ridge baseline
tree-based baseline
quantile regression baseline

Why:

they are fast
they are interpretable
they set a standard the GPU model must beat

When To Use the GPU

The DGX should be used deliberately.

Good uses:

multi-horizon quantile neural nets
sequence models if temporal dependencies prove helpful
batched inference across many tickers

Bad early use:

using a GPU because it feels advanced
replacing simple methods before measuring them

The right rule is:

use the GPU when it buys better calibration, better performance, or better research throughput.

Why Not Start With CUDA / C++

For this project phase, Python is the right language.

Use:

pandas
numpy
scikit-learn
networkx or graph tooling if needed
PyTorch where deep models are justified

Avoid starting with CUDA/C++ because:

the main risk is not compute speed
the main risk is wrong problem framing
the main bottleneck is feature and target design
low-level optimization will slow learning when the research loop is still changing

That said, CUDA can be a very good second-stage learning project.

Good later CUDA experiments:

rolling-window feature kernels
pairwise correlation acceleration
Monte Carlo path simulation
large-batch inference kernels

That is a much better learning path than rewriting the first version in C++.

A Day-By-Day Build Strategy

Day 7

Freeze architecture, schemas, and container boundaries.

Day 8

Build the feature store and validate it by hand.

Day 9

Fit regime models and inspect historical transitions.

Day 10

Build cluster and market topology views.

Day 11

Detect outliers and create the review queue.

Day 12

Build baseline scenario models and quantile ranges.

Day 13

Add GPU-enabled scenario models only if baselines are stable.

Day 14

Build daily reporting and research output.

This order matters because it keeps the system explainable at every step.

Dockerized Pipeline Design

This project should remain one-job-per-container.

CPU Containers

feature generation
regime detection
clustering
outlier detection
report generation

These are parallelizable and should scale across CPU threads.

GPU Containers

scenario model training
batched scenario inference

These should be isolated so the NVIDIA runtime is only used where it adds value.

Shared Storage

Use:

PostgreSQL for structured outputs
named volume for model artifacts
optional report artifact volume for markdown/PDF outputs

This keeps the pipeline reproducible and portable.

What Success Looks Like

The project is succeeding when you can open a daily report and quickly see:

the current regime at multiple horizons
whether the market is broadening or narrowing
which clusters are strongest
which names are unusual
what the scenario bands say for those names
which outliers should be checked against news

At that point, the system is doing what it should:

it is not pretending to know the future exactly
it is narrowing attention intelligently
it is teaching you how the market is structured
it is teaching you how the models behave

That is the right foundation.

Final Recommendation

Build this as a market intelligence system first and a forecasting system second.

That order will make the project:

more robust
easier to debug
more educational
more useful in practice

The most important discipline is this:

do not let the desire for a sophisticated model outrun the quality of the problem definition.

If the system can reliably identify regime changes, cluster structure, outliers, and scenario ranges, then later additions like news scoring, LLM summaries, or deeper GPU models will have a much stronger foundation.