Back to Blog
tradingtradingregimemlmarket-analysissignals

From Daily Bars to Market Regime Intelligence on a DGX

Building a market regime intelligence layer: detecting bull, bear, and sideways markets using unsupervised classification on price features.

December 12, 2025·11 min read

From Daily Bars to Market Regime Intelligence on a DGX

Introduction

Most trading systems start too late in the pipeline.

They begin with a model and ask, "What should we predict?" before they can clearly answer:

  • what market state are we in
  • which names are moving together
  • which names are behaving abnormally
  • whether a move is part of a broad regime or a stock-specific event

For a Nasdaq-100 universe with 20 years of daily OHLCV data, a better first project is not "predict tomorrow's close." A better first project is to build market regime intelligence.

That means building a system that can:

  • detect bullish, bearish, transitional, and high-volatility states
  • group stocks by shared behavior
  • surface names that detach from their peers
  • estimate ranges of outcomes instead of overconfident point predictions
  • send the most interesting anomalies to a human or news layer for review

This document lays out that design for dgx-trading-system.

The aim is twofold:

  1. improve market understanding
  2. improve model understanding

Those are different goals, and the system should support both.


The Core Idea

The market is not one environment. It shifts.

Some periods are:

  • broad bullish trends with tight leadership
  • broad bullish trends with low volatility
  • narrow rallies led by a few megacaps
  • panic regimes with rising correlation
  • mean-reverting churn
  • post-shock recovery transitions

Trying to fit one model across all of these without first identifying the state usually produces fragile conclusions.

So the project should work in layers:

  1. detect regimes
  2. find correlated groups inside each regime
  3. detect outliers relative to the group
  4. estimate the range of forward outcomes given the setup
  5. cross-check outliers against market news and events

This layered design is far more useful than a single "BUY/SELL" model.


Why Regimes Matter

A stock can behave very differently depending on the environment.

For example:

  • In a broad risk-on regime, semiconductors may move as a cluster and respond to momentum.
  • In a panic regime, correlation may spike and stock-specific factors matter less.
  • In a post-earnings dispersion regime, stock-specific news can dominate sector behavior.

If you do not condition on regime, you mix together incompatible behaviors and pollute both your features and your labels.

Regime labeling helps answer:

  • Is the current move broad or narrow?
  • Is correlation rising or breaking apart?
  • Are leaders stable or rotating?
  • Should a stock's move be judged relative to the market, sector, or its own cluster?

That is why regime detection belongs near the front of the pipeline.


Why Correlated Groups Matter

Raw correlation is informative but incomplete.

If two names are highly correlated, that alone does not tell you:

  • whether they are part of a stable cluster
  • whether their relationship is regime-specific
  • whether one of them is currently behaving unusually

Clustering and graph-style thinking help because the market is not just a list of tickers. It is a network.

In a good market intelligence system, you do not just ask:

"Is NVDA up?"

You ask:

  • Is NVDA strong relative to semis?
  • Is the semiconductor cluster itself strong?
  • Is the cluster central to the whole market right now?
  • Is NVDA acting like its cluster, or detaching from it?

That shift in framing is where a lot of useful intelligence comes from.


Why Outliers Matter

Average behavior is often less interesting than abnormal behavior.

Suppose most of a cluster is flat, but one stock is surging on volume, breaking its normal correlation profile, and moving to the edge of the feature distribution. That is often where the most actionable signal is.

Outliers can mean:

  • company-specific news
  • earnings surprises
  • product launches
  • analyst revisions
  • policy or regulatory events
  • unusual positioning or forced flows
  • early leadership rotation

The correct response to an outlier is not immediate trust. The correct response is investigation.

That is why the system should send strong outliers into a review queue for news cross-checking.


Why Range-of-Outcomes Beats Point Prediction

Markets are noisy, and next-day direction is a weak target on its own.

A point forecast like "tomorrow return = 0.62%" sounds precise, but it usually hides the uncertainty that matters most.

For practical research, it is usually more useful to know:

  • downside tail
  • upside tail
  • median expected path
  • uncertainty width under current regime
  • whether historical outcomes were stable or chaotic under similar conditions

That is why the system should output scenario ranges such as:

  • 10th percentile return
  • 25th percentile return
  • median return
  • 75th percentile return
  • 90th percentile return

This is both more honest and more useful.


Project Architecture

The project should remain modular and Dockerized.

Existing Foundation

dgx-trading-system already has:

  • PostgreSQL
  • daily ingestion for a large Nasdaq-100 universe
  • Dockerized workloads
  • NVIDIA-capable model containers

That is enough to build the next layer without changing ingestion.

analytics/regime/
  features.py
  market_features.py
  regime_hmm.py
  change_points.py
  pca_factors.py
  correlations.py
  clusters.py
  outliers.py
  run_features.py
  run_regimes.py
  run_clusters.py
  run_outliers.py
  summarize.py

model/outcomes/
  dataset.py
  targets.py
  baseline_models.py
  quantile_model.py
  montecarlo.py
  train.py
  train_universe.py
  predict.py
  predict_universe.py

This preserves the existing separation:

  • ingestion stays ingestion
  • analytics stays analytics
  • models stay models
  • reporting stays reporting

That is a healthy design for both production and learning.


The Data Model

The system should not dump everything into one table.

Instead, it should separate:

  • raw prices
  • derived features
  • regime labels
  • relationship graphs / clusters
  • outlier events
  • forecast ranges
  • review metadata

That gives you traceability.

Key Tables

regime_feature_daily

Stores ticker-level engineered features. This is the substrate used by regime, cluster, outlier, and scenario models.

market_regime_state

Stores market-level hidden states by date and horizon. This lets you ask:

  • what regime were we in on this date?
  • did it change recently?
  • how confident was the regime model?

correlation_cluster_daily

Stores cluster membership by horizon and date. This turns raw correlation into a usable structure.

outlier_event_daily

Stores names behaving abnormally relative to their current cluster or own history.

scenario_forecast_daily

Stores range-based forward views instead of only a directional guess.

news_review_queue

Stores items for manual or later automated event review.

This table is where quant research hands off to qualitative research.


Feature Engineering

Feature engineering should start with transparent, interpretable signals.

That matters because if later models behave strangely, you need to know whether the problem came from:

  • the raw data
  • the feature logic
  • the regime model
  • the clustering step
  • the forecasting model

Interpretable features make that possible.

  • 1-day log return
  • 5-day log return
  • 20-day log return
  • realized volatility over 20 and 60 days
  • ATR(14)
  • RSI(14)
  • MACD histogram
  • Bollinger %B
  • volume ratio versus 20-day average
  • high-low range normalized by close
  • close-to-open return
  • 5-day rate of change
  • distance from SMA20 and SMA60
  • beta to QQQ over 60 days
  • correlation to QQQ over 60 days
  • relative strength versus QQQ
  • rolling max drawdown

Aggregate universe signals such as:

  • fraction of tickers above SMA20 / SMA60 / SMA200
  • average cross-sectional return
  • average realized volatility
  • median correlation
  • dispersion of returns
  • top-decile versus bottom-decile spread
  • share of positive 20-day momentum names

These features are more useful for regime detection than looking at QQQ alone.


Regime Detection Theory

There is no single perfect regime algorithm, so the system should use a small stack of complementary methods.

Hidden Markov Models

HMMs are a strong first choice because they model:

  • latent states
  • transition persistence
  • observation noise

For example, the system might learn states like:

  • low-volatility bullish trend
  • high-volatility bullish recovery
  • defensive or bearish regime
  • transition regime

The exact names come after fitting. The data determines the hidden states; the human interprets them.

Change-Point Detection

HMMs are good at persistent states, but they do not always mark boundaries cleanly.

Change-point detection complements them by identifying moments where structure shifts:

  • a volatility break
  • a trend break
  • a sudden change in cross-sectional dispersion

This is helpful when you want to answer:

"When did the new regime begin?"

Rolling PCA

Rolling PCA is not a regime model by itself, but it is very useful for diagnosing structure.

It tells you:

  • whether one factor dominates the market
  • whether leadership is broad or narrow
  • whether the factor structure is stable or changing

That makes it excellent for interpreting "bullish" conditions:

  • broad bullish leadership
  • concentrated bullish leadership
  • unstable rally with rising dispersion

Those are different environments, and rolling PCA helps reveal that.


Clustering and Market Topology

Correlation matrices are useful, but cluster structure is often more actionable than the raw matrix.

Hierarchical clustering on rolling correlation distance:

distance(a, b) = sqrt(0.5 * (1 - corr(a, b)))

Why start here:

  • interpretable
  • stable enough for daily research
  • good for visual diagnostics
  • easy to compare across horizons

What To Look For

  • which clusters persist across windows
  • which clusters change sharply across regime transitions
  • whether leadership is dominated by one cluster
  • whether a stock jumps between clusters frequently

Cluster instability can itself be useful information.


Outlier Detection Theory

There is no single outlier.

There are different kinds of unusual behavior, and each suggests a different story.

1. Return Outliers

These are names whose returns are abnormally strong or weak relative to their cluster peers.

Use:

  • robust z-score
  • median absolute deviation

This is simple and useful.

2. Correlation Outliers

These are names whose co-movement structure breaks down.

Examples:

  • a stock normally moves with semiconductors but suddenly disconnects
  • a cluster leader stops behaving like the cluster
  • a defensive name starts trading like a momentum name

This kind of outlier is often particularly valuable.

3. Feature-Space Outliers

These are names whose entire behavior profile becomes unusual.

Use:

  • Mahalanobis distance
  • later, maybe Isolation Forest

This captures multi-dimensional abnormality better than one indicator can.

Why Outliers Need News Review

An outlier may be:

  • a real event
  • a data problem
  • an artifact of an unstable cluster
  • a temporary flow-driven move

That is why the correct pattern is:

  1. quantify the anomaly
  2. rank the severity
  3. send it for review
  4. tag the driver if confirmed

That process turns raw anomaly detection into intelligence.


Outcome Modeling Theory

Once the system knows:

  • current regime
  • current cluster
  • outlier status
  • current feature state

it can estimate what typically happened next under similar conditions.

Best First Step: Conditional Historical Distributions

This is often the highest-value first model.

Condition on:

  • regime
  • cluster
  • volatility bucket
  • relative strength bucket

Then compute forward returns at:

  • 5 days
  • 20 days
  • 60 days

This alone can produce useful scenario tables.

Quantile Regression

Then move to quantile regression, which predicts:

  • p10
  • p25
  • p50
  • p75
  • p90

This is much better aligned with real market uncertainty than a single target.

Baseline Models Before Deep Models

Before using a GPU-heavy architecture, build:

  • linear / ridge baseline
  • tree-based baseline
  • quantile regression baseline

Why:

  • they are fast
  • they are interpretable
  • they set a standard the GPU model must beat

When To Use the GPU

The DGX should be used deliberately.

Good uses:

  • multi-horizon quantile neural nets
  • sequence models if temporal dependencies prove helpful
  • batched inference across many tickers

Bad early use:

  • using a GPU because it feels advanced
  • replacing simple methods before measuring them

The right rule is:

use the GPU when it buys better calibration, better performance, or better research throughput.


Why Not Start With CUDA / C++

For this project phase, Python is the right language.

Use:

  • pandas
  • numpy
  • scikit-learn
  • networkx or graph tooling if needed
  • PyTorch where deep models are justified

Avoid starting with CUDA/C++ because:

  • the main risk is not compute speed
  • the main risk is wrong problem framing
  • the main bottleneck is feature and target design
  • low-level optimization will slow learning when the research loop is still changing

That said, CUDA can be a very good second-stage learning project.

Good later CUDA experiments:

  • rolling-window feature kernels
  • pairwise correlation acceleration
  • Monte Carlo path simulation
  • large-batch inference kernels

That is a much better learning path than rewriting the first version in C++.


A Day-By-Day Build Strategy

Day 7

Freeze architecture, schemas, and container boundaries.

Day 8

Build the feature store and validate it by hand.

Day 9

Fit regime models and inspect historical transitions.

Day 10

Build cluster and market topology views.

Day 11

Detect outliers and create the review queue.

Day 12

Build baseline scenario models and quantile ranges.

Day 13

Add GPU-enabled scenario models only if baselines are stable.

Day 14

Build daily reporting and research output.

This order matters because it keeps the system explainable at every step.


Dockerized Pipeline Design

This project should remain one-job-per-container.

CPU Containers

  • feature generation
  • regime detection
  • clustering
  • outlier detection
  • report generation

These are parallelizable and should scale across CPU threads.

GPU Containers

  • scenario model training
  • batched scenario inference

These should be isolated so the NVIDIA runtime is only used where it adds value.

Shared Storage

Use:

  • PostgreSQL for structured outputs
  • named volume for model artifacts
  • optional report artifact volume for markdown/PDF outputs

This keeps the pipeline reproducible and portable.


What Success Looks Like

The project is succeeding when you can open a daily report and quickly see:

  • the current regime at multiple horizons
  • whether the market is broadening or narrowing
  • which clusters are strongest
  • which names are unusual
  • what the scenario bands say for those names
  • which outliers should be checked against news

At that point, the system is doing what it should:

  • it is not pretending to know the future exactly
  • it is narrowing attention intelligently
  • it is teaching you how the market is structured
  • it is teaching you how the models behave

That is the right foundation.


Final Recommendation

Build this as a market intelligence system first and a forecasting system second.

That order will make the project:

  • more robust
  • easier to debug
  • more educational
  • more useful in practice

The most important discipline is this:

do not let the desire for a sophisticated model outrun the quality of the problem definition.

If the system can reliably identify regime changes, cluster structure, outliers, and scenario ranges, then later additions like news scoring, LLM summaries, or deeper GPU models will have a much stronger foundation.