Back to Blog
technical-referencetradingregimeroadmapmlsignalresearch

Day 7 Onward — Market Regime Intelligence Roadmap

Day 7 regime intelligence roadmap: planned enhancements to market regime detection, cross-asset signals, and adaptive position sizing.

February 1, 2026·11 min read

Day 7 Onward — Market Regime Intelligence Roadmap

Mission

Build a new analytics project inside dgx-trading-system that turns 20 years of Nasdaq-100 daily OHLCV history into:

  • market regime labels across multiple horizons
  • groups of correlated stocks that behave similarly inside each regime
  • outlier detection for names that detach from their group
  • range-of-outcomes forecasts for stocks and clusters
  • a workflow for cross-checking statistical anomalies against market news

This roadmap starts at Day 7 and continues forward from the ingestion work already completed in ingestion/.

The guiding goal is not only to produce forecasts, but also to build a system that helps explain:

  • what regime the market is in
  • which stocks are moving together
  • which names are unusually strong or weak
  • what outcomes historically followed similar conditions

Why This Project Shape

The system should separate concerns cleanly:

  • ingestion/ loads raw market data and stays unchanged
  • analytics/regime/ creates market features, regime labels, clusters, and outlier events
  • model/outcomes/ trains forecasting and scenario models
  • reporting/ produces summaries, blog-ready outputs, and later LLM/news context

This makes the pipeline modular, easier to test, and easier to reason about.


Target Questions

By the end of this phase, the system should answer:

  1. What market regime are we currently in over 30/60/90/180/360-day windows?
  2. Which stocks are tightly linked in the current regime?
  3. Which stocks are behaving abnormally relative to their peers?
  4. What forward return ranges historically followed similar setups?
  5. Which flagged outliers deserve manual or automated news review?

Pipeline

market_data_daily
      |
      v
regime feature engineering (CPU, parallel across tickers / windows)
      |
      +--> market-level regime detection (CPU)
      |
      +--> rolling correlation + clustering (CPU)
      |
      +--> outlier detection (CPU)
      |
      v
scenario feature store
      |
      +--> baseline forecasting models (CPU)
      |
      +--> deep / GPU scenario models (NVIDIA)
      |
      v
scenario forecasts + daily intelligence summary
      |
      v
news cross-check / analyst review

Container Roles

  • ingestion: CPU, existing, unchanged
  • regime-features: CPU, parallel feature computation and DB upserts
  • regime-detect: CPU, HMM / change-point / PCA jobs
  • regime-cluster: CPU, rolling correlations and clustering jobs
  • regime-outliers: CPU, robust outlier detection and event generation
  • scenario-train: GPU-enabled, trains probabilistic outcome models
  • scenario-predict: GPU-enabled, writes forward scenario ranges
  • report-daily: CPU, composes summaries for review and later LLM/news layers

Parallelism Strategy

  • Parallelize per-ticker and per-window feature generation on CPU
  • Parallelize rolling-correlation jobs by window on CPU
  • Keep heavy GPU training controlled and sequential per model family
  • Allow batched GPU inference once artifacts are stable

This gives us real parallel throughput without fighting for the GPU.


Proposed New Project Layout

dgx-trading-system/
  analytics/
    regime/
      Dockerfile
      requirements.txt
      tickers.py
      db.py
      windows.py
      features.py
      market_features.py
      regime_hmm.py
      change_points.py
      pca_factors.py
      correlations.py
      clusters.py
      outliers.py
      run_features.py
      run_regimes.py
      run_clusters.py
      run_outliers.py
      summarize.py
      schema.sql
  model/
    outcomes/
      Dockerfile
      requirements.txt
      dataset.py
      targets.py
      baseline_models.py
      quantile_model.py
      montecarlo.py
      train.py
      train_universe.py
      predict.py
      predict_universe.py
      schema.sql
  reporting/
    market_intelligence/
      Dockerfile
      requirements.txt
      compose_summary.py
      render_markdown.py
  docs/
    day-07-regime-intelligence-roadmap.md
    market-regime-intelligence-blog.md

Database Design

These tables keep raw data, derived analytics, forecasts, and review metadata separate. That separation matters for debugging and for learning the models.

1. regime_feature_daily

One row per (ticker, date), holding row-level engineered features.

CREATE TABLE IF NOT EXISTS regime_feature_daily (
    ticker              TEXT        NOT NULL,
    date                DATE        NOT NULL,
    log_return_1d       FLOAT,
    log_return_5d       FLOAT,
    log_return_20d      FLOAT,
    realized_vol_20d    FLOAT,
    realized_vol_60d    FLOAT,
    atr_14              FLOAT,
    rsi_14              FLOAT,
    macd_hist           FLOAT,
    bb_pct_b            FLOAT,
    volume_ratio_20d    FLOAT,
    hl_range            FLOAT,
    close_to_open       FLOAT,
    roc_5               FLOAT,
    distance_sma_20     FLOAT,
    distance_sma_60     FLOAT,
    beta_qqq_60         FLOAT,
    corr_qqq_60         FLOAT,
    rel_strength_qqq_20 FLOAT,
    max_drawdown_60     FLOAT,
    updated_at          TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (ticker, date)
);

2. market_regime_state

One row per (window_days, date) representing the market-level hidden state.

CREATE TABLE IF NOT EXISTS market_regime_state (
    window_days         INT         NOT NULL,
    date                DATE        NOT NULL,
    regime_label        TEXT        NOT NULL,
    regime_id           INT         NOT NULL,
    regime_probability  FLOAT,
    model_name          TEXT        NOT NULL,
    model_version       TEXT        NOT NULL,
    transition_flag     BOOLEAN     DEFAULT FALSE,
    metadata            JSONB,
    generated_at        TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (window_days, date, model_version)
);

3. rolling_relationship_daily

One row per (date, window_days, ticker_a, ticker_b) for filtered pairwise relationships. Only store strong or top-k links to avoid table explosion.

CREATE TABLE IF NOT EXISTS rolling_relationship_daily (
    date                DATE        NOT NULL,
    window_days         INT         NOT NULL,
    ticker_a            TEXT        NOT NULL,
    ticker_b            TEXT        NOT NULL,
    corr_value          FLOAT       NOT NULL,
    beta_ratio          FLOAT,
    same_cluster        BOOLEAN,
    generated_at        TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (date, window_days, ticker_a, ticker_b)
);

4. correlation_cluster_daily

Cluster assignment for each ticker and horizon.

CREATE TABLE IF NOT EXISTS correlation_cluster_daily (
    date                DATE        NOT NULL,
    window_days         INT         NOT NULL,
    ticker              TEXT        NOT NULL,
    cluster_id          INT         NOT NULL,
    cluster_method      TEXT        NOT NULL,
    regime_label        TEXT,
    centrality_score    FLOAT,
    generated_at        TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (date, window_days, ticker, cluster_method)
);

5. outlier_event_daily

Flags unusual names relative to their cluster, regime, or historical profile.

CREATE TABLE IF NOT EXISTS outlier_event_daily (
    id                  BIGSERIAL PRIMARY KEY,
    date                DATE        NOT NULL,
    ticker              TEXT        NOT NULL,
    window_days         INT         NOT NULL,
    outlier_type        TEXT        NOT NULL,
    severity_score      FLOAT       NOT NULL,
    cluster_id          INT,
    regime_label        TEXT,
    explanation         JSONB,
    needs_news_review   BOOLEAN     DEFAULT TRUE,
    generated_at        TIMESTAMPTZ DEFAULT NOW()
);

6. scenario_forecast_daily

Stores range-of-outcomes forecasts rather than only point predictions.

CREATE TABLE IF NOT EXISTS scenario_forecast_daily (
    ticker              TEXT        NOT NULL,
    as_of_date          DATE        NOT NULL,
    horizon_days        INT         NOT NULL,
    regime_label        TEXT,
    cluster_id          INT,
    p10_return          FLOAT,
    p25_return          FLOAT,
    p50_return          FLOAT,
    p75_return          FLOAT,
    p90_return          FLOAT,
    expected_return     FLOAT,
    expected_vol        FLOAT,
    downside_prob       FLOAT,
    upside_prob         FLOAT,
    model_name          TEXT        NOT NULL,
    model_version       TEXT        NOT NULL,
    metadata            JSONB,
    generated_at        TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (ticker, as_of_date, horizon_days, model_version)
);

7. news_review_queue

This is the bridge between quant flags and qualitative research.

CREATE TABLE IF NOT EXISTS news_review_queue (
    id                  BIGSERIAL PRIMARY KEY,
    date                DATE        NOT NULL,
    ticker              TEXT        NOT NULL,
    source_event_type   TEXT        NOT NULL,
    source_event_id     BIGINT,
    status              TEXT        NOT NULL DEFAULT 'PENDING',
    analyst_notes       TEXT,
    tagged_driver       TEXT,
    reviewed_at         TIMESTAMPTZ,
    created_at          TIMESTAMPTZ DEFAULT NOW()
);

Algorithms By Layer

Layer 1: Regime Features

Use rolling windows of:

  • 30
  • 60
  • 90
  • 180
  • 360

Feature families:

  • returns and log returns
  • realized volatility
  • drawdown
  • distance from moving averages
  • trend persistence
  • volume intensity
  • beta and correlation to QQQ
  • relative strength versus QQQ
  • breadth features aggregated at the market level

Primary implementation:

  • pandas first, possibly polars later if CPU becomes a bottleneck
  • pure functions, no side effects except DB upserts in runner scripts

Layer 2: Regime Detection

Use a combination of methods, not a single algorithm.

Primary:

  • Hidden Markov Model on market-level features

Secondary:

  • Bayesian or offline change-point detection for transition dates
  • Rolling PCA / factor decomposition for market structure diagnostics

Why this combination:

  • HMM labels persistent hidden states
  • change-point detection confirms structural breaks
  • PCA shows whether leadership is broad or concentrated

Layer 3: Correlated Groups

Primary:

  • rolling correlation matrix
  • hierarchical clustering using correlation distance

Secondary:

  • graph-based community detection later if needed

Distance:

distance(a, b) = sqrt(0.5 * (1 - corr(a, b)))

Store only:

  • top-k strongest links per ticker
  • cluster assignment
  • cluster stability metrics

Layer 4: Outlier Detection

Use multiple outlier types.

  1. Return outliers

    • robust z-score versus cluster peers
  2. Correlation outliers

    • ticker detaches from normal cluster neighbors
  3. Feature-space outliers

    • Mahalanobis distance
    • Isolation Forest later if needed

Primary v1 choice:

  • robust z-score + Mahalanobis distance

Layer 5: Outcome Modeling

This layer should start simple before going deep.

Baseline models first:

  • conditional historical forward-return tables
  • quantile regression
  • random forest / gradient boosting baseline

GPU-enhanced models second:

  • PyTorch MLP or sequence model for multi-horizon quantile prediction
  • optional temporal model once the baseline is proven useful

The main objective is not a single directional label. It is a conditional range:

  • expected return
  • tail loss
  • upside potential
  • regime-conditioned uncertainty

Layer 6: News Cross-Check

Do not let news drive the first-stage signal.

Instead:

  • detect anomaly quantitatively
  • enqueue it for review
  • attach cause tags later
  • eventually train on those tags as structured metadata

This keeps the quantitative system honest and explainable.


Day-By-Day Plan

Day 7 — Architecture and Schema Freeze

Goal

Lock the project shape before code spreads in too many directions.

Deliverables

  • create analytics/regime/ and model/outcomes/ module skeletons
  • write schema files for analytics and outcomes tables
  • update docker-compose.yml with new profiles and service stubs
  • write a short runbook for the new pipeline

What to implement

  • analytics/regime/schema.sql
  • model/outcomes/schema.sql
  • base Dockerfiles and pinned requirements
  • shared DB helpers

Learning goal

Understand why raw data, derived features, hidden states, outlier events, and forecast outputs should live in different tables.


Day 8 — Feature Store and Data QA

Goal

Build a trustworthy feature layer from market_data_daily.

Deliverables

  • features.py
  • market_features.py
  • run_features.py
  • validation queries and row-count checks

Algorithms

  • rolling returns
  • volatility
  • ATR
  • RSI
  • MACD histogram
  • Bollinger %B
  • beta/correlation to QQQ
  • relative strength

Validation

  • compare features for 3 known tickers manually
  • ensure no future leakage
  • confirm warmup periods are handled correctly

Learning goal

See how almost all later models depend on feature correctness more than model complexity.


Day 9 — Regime Detection Engine

Goal

Label market states over 30/60/90/180/360-day windows.

Deliverables

  • regime_hmm.py
  • change_points.py
  • pca_factors.py
  • run_regimes.py

Algorithms

  • Gaussian HMM on market-level features
  • change-point detection on QQQ and breadth features
  • rolling PCA on standardized cross-sectional returns

Validation

  • inspect regime labels during 2008, 2020, 2022, 2023, 2024, 2025
  • ensure state counts are not degenerate
  • verify transition flags cluster near real structural breaks

Learning goal

Learn the difference between persistent hidden states and abrupt structural breaks.


Day 10 — Correlation Clusters and Market Topology

Goal

Find groups of stocks that move together inside each regime.

Deliverables

  • correlations.py
  • clusters.py
  • run_clusters.py

Algorithms

  • rolling correlation matrices
  • hierarchical clustering
  • cluster stability across windows
  • simple network statistics such as degree and centrality

Validation

  • confirm semis, software, mega-cap, consumer, and biotech names form sensible groups
  • compare cluster maps across bullish and defensive regimes

Learning goal

Understand that regime shifts often change the market graph before they show up in simple return summaries.


Day 11 — Outlier Engine and Review Queue

Goal

Detect names that detach from their peers and deserve attention.

Deliverables

  • outliers.py
  • run_outliers.py
  • summary generation into news_review_queue

Algorithms

  • robust z-score versus cluster
  • Mahalanobis distance in feature space
  • correlation breakdown detector

Validation

  • backtest known event dates around earnings, AI rallies, shocks, and guidance changes
  • verify that flagged outliers are interpretable, not just noise

Learning goal

See that abnormal behavior is often more actionable than average behavior.


Day 12 — Scenario Models, Baselines First

Goal

Predict ranges of outcomes instead of only point direction.

Deliverables

  • targets.py
  • dataset.py
  • baseline_models.py
  • quantile_model.py
  • train.py

Algorithms

  • conditional forward return distributions
  • quantile regression for 5, 20, 60-day horizons
  • random forest or gradient boosting baseline

Validation

  • compare quantile calibration
  • check whether p10/p50/p90 bands are sensible during different regimes
  • benchmark against naive baselines

Learning goal

Learn why uncertainty bands are usually more useful than raw directional calls.


Day 13 — GPU Models and Universe Training

Goal

Use the DGX deliberately where it adds value.

Deliverables

  • GPU-enabled scenario training
  • train_universe.py
  • model artifact persistence to model_weights
  • predict.py and predict_universe.py

Algorithms

  • PyTorch MLP or temporal model for multi-horizon quantiles
  • batch inference by ticker universe

Validation

  • compare GPU model against baselines
  • log calibration and error by regime
  • do not keep the GPU model if it does not beat the simpler baselines

Learning goal

Learn when deep models actually add value and when they only add complexity.


Day 14 — Reporting, Blog Output, and Operating Rhythm

Goal

Turn the analytics into a repeatable daily research loop.

Deliverables

  • summarize.py
  • reporting container
  • daily markdown intelligence report
  • dashboard-ready SQL queries

Daily report should answer

  • current regime by window
  • strongest clusters
  • top positive and negative outliers
  • scenario ranges for key names and groups
  • queue items for news review

Learning goal

Build the habit of reading model outputs critically instead of treating them as oracles.


Phase 2 After Day 14

Only after the core pipeline is stable:

  • add market-news ingestion and event tagging
  • add embeddings or topic clustering for news explanations
  • add graph neural or sequence models only if the simpler system proves useful
  • optionally experiment with CUDA-accelerated feature kernels as a learning track

Docker Compose Design

Proposed New Services

regime-features:
  build: ./analytics/regime
  command: ["python", "run_features.py"]
  profiles: ["regime"]

regime-detect:
  build: ./analytics/regime
  command: ["python", "run_regimes.py"]
  profiles: ["regime"]

regime-cluster:
  build: ./analytics/regime
  command: ["python", "run_clusters.py"]
  profiles: ["regime"]

regime-outliers:
  build: ./analytics/regime
  command: ["python", "run_outliers.py"]
  profiles: ["regime"]

scenario-train:
  build: ./model/outcomes
  command: ["python", "train_universe.py"]
  runtime: nvidia
  profiles: ["outcomes"]

scenario-predict:
  build: ./model/outcomes
  command: ["python", "predict_universe.py"]
  runtime: nvidia
  profiles: ["outcomes"]

report-daily:
  build: ./reporting/market_intelligence
  command: ["python", "compose_summary.py"]
  profiles: ["report"]

Shared Volumes

  • postgres_data
  • model_weights
  • optional report_artifacts

Runtime Plan

  1. ingest profile updates raw prices
  2. regime profile recomputes features, states, clusters, outliers
  3. outcomes profile trains or predicts scenario bands
  4. report profile writes a human-readable summary

Validation Framework

The project should be considered healthy only if it passes these checks.

Data Quality

  • no missing dates beyond expected market holidays
  • warmup rows handled consistently
  • no duplicate keys

Statistical Quality

  • regime labels persist long enough to be meaningful
  • clusters are stable enough to interpret
  • outlier flags are sparse and explainable
  • scenario ranges are calibrated

Operational Quality

  • all jobs are Dockerized
  • CPU-heavy jobs parallelize safely
  • GPU jobs are isolated and reproducible
  • artifacts are versioned and reusable

Should We Use CUDA / C++?

Not for the first implementation.

For this phase:

  • use Python
  • use pandas, numpy, scikit-learn, hmmlearn or equivalent
  • use PyTorch on the DGX only where it helps

Why:

  • feature engineering correctness matters more than low-level speed right now
  • the first bottleneck is research design, not kernel performance
  • CUDA/C++ is better as a focused learning project after the baseline pipeline is proven

Good future CUDA learning targets:

  • rolling-window indicator kernels
  • large-scale Monte Carlo simulation
  • pairwise correlation acceleration

Definition of Success

This phase succeeds when dgx-trading-system can reliably produce:

  • multi-horizon regime labels
  • regime-aware correlation groups
  • meaningful outlier events
  • range-of-outcomes forecasts
  • daily summaries that point you toward what deserves news review

If the system does that, it will help with both:

  • market understanding
  • model understanding

That combination is the right foundation before building more aggressive ML.