Day 7 Onward — Market Regime Intelligence Roadmap

Mission

Build a new analytics project inside dgx-trading-system that turns 20 years of Nasdaq-100 daily OHLCV history into:

market regime labels across multiple horizons
groups of correlated stocks that behave similarly inside each regime
outlier detection for names that detach from their group
range-of-outcomes forecasts for stocks and clusters
a workflow for cross-checking statistical anomalies against market news

This roadmap starts at Day 7 and continues forward from the ingestion work already completed in ingestion/.

The guiding goal is not only to produce forecasts, but also to build a system that helps explain:

what regime the market is in
which stocks are moving together
which names are unusually strong or weak
what outcomes historically followed similar conditions

Why This Project Shape

The system should separate concerns cleanly:

ingestion/ loads raw market data and stays unchanged
analytics/regime/ creates market features, regime labels, clusters, and outlier events
model/outcomes/ trains forecasting and scenario models
reporting/ produces summaries, blog-ready outputs, and later LLM/news context

This makes the pipeline modular, easier to test, and easier to reason about.

Target Questions

By the end of this phase, the system should answer:

What market regime are we currently in over 30/60/90/180/360-day windows?
Which stocks are tightly linked in the current regime?
Which stocks are behaving abnormally relative to their peers?
What forward return ranges historically followed similar setups?
Which flagged outliers deserve manual or automated news review?

Recommended System Design

Pipeline

market_data_daily
      |
      v
regime feature engineering (CPU, parallel across tickers / windows)
      |
      +--> market-level regime detection (CPU)
      |
      +--> rolling correlation + clustering (CPU)
      |
      +--> outlier detection (CPU)
      |
      v
scenario feature store
      |
      +--> baseline forecasting models (CPU)
      |
      +--> deep / GPU scenario models (NVIDIA)
      |
      v
scenario forecasts + daily intelligence summary
      |
      v
news cross-check / analyst review

Container Roles

ingestion: CPU, existing, unchanged
regime-features: CPU, parallel feature computation and DB upserts
regime-detect: CPU, HMM / change-point / PCA jobs
regime-cluster: CPU, rolling correlations and clustering jobs
regime-outliers: CPU, robust outlier detection and event generation
scenario-train: GPU-enabled, trains probabilistic outcome models
scenario-predict: GPU-enabled, writes forward scenario ranges
report-daily: CPU, composes summaries for review and later LLM/news layers

Parallelism Strategy

Parallelize per-ticker and per-window feature generation on CPU
Parallelize rolling-correlation jobs by window on CPU
Keep heavy GPU training controlled and sequential per model family
Allow batched GPU inference once artifacts are stable

This gives us real parallel throughput without fighting for the GPU.

Proposed New Project Layout

dgx-trading-system/
  analytics/
    regime/
      Dockerfile
      requirements.txt
      tickers.py
      db.py
      windows.py
      features.py
      market_features.py
      regime_hmm.py
      change_points.py
      pca_factors.py
      correlations.py
      clusters.py
      outliers.py
      run_features.py
      run_regimes.py
      run_clusters.py
      run_outliers.py
      summarize.py
      schema.sql
  model/
    outcomes/
      Dockerfile
      requirements.txt
      dataset.py
      targets.py
      baseline_models.py
      quantile_model.py
      montecarlo.py
      train.py
      train_universe.py
      predict.py
      predict_universe.py
      schema.sql
  reporting/
    market_intelligence/
      Dockerfile
      requirements.txt
      compose_summary.py
      render_markdown.py
  docs/
    day-07-regime-intelligence-roadmap.md
    market-regime-intelligence-blog.md

Database Design

These tables keep raw data, derived analytics, forecasts, and review metadata separate. That separation matters for debugging and for learning the models.

1. `regime_feature_daily`

One row per (ticker, date), holding row-level engineered features.

CREATE TABLE IF NOT EXISTS regime_feature_daily (
    ticker              TEXT        NOT NULL,
    date                DATE        NOT NULL,
    log_return_1d       FLOAT,
    log_return_5d       FLOAT,
    log_return_20d      FLOAT,
    realized_vol_20d    FLOAT,
    realized_vol_60d    FLOAT,
    atr_14              FLOAT,
    rsi_14              FLOAT,
    macd_hist           FLOAT,
    bb_pct_b            FLOAT,
    volume_ratio_20d    FLOAT,
    hl_range            FLOAT,
    close_to_open       FLOAT,
    roc_5               FLOAT,
    distance_sma_20     FLOAT,
    distance_sma_60     FLOAT,
    beta_qqq_60         FLOAT,
    corr_qqq_60         FLOAT,
    rel_strength_qqq_20 FLOAT,
    max_drawdown_60     FLOAT,
    updated_at          TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (ticker, date)
);

2. `market_regime_state`

One row per (window_days, date) representing the market-level hidden state.

CREATE TABLE IF NOT EXISTS market_regime_state (
    window_days         INT         NOT NULL,
    date                DATE        NOT NULL,
    regime_label        TEXT        NOT NULL,
    regime_id           INT         NOT NULL,
    regime_probability  FLOAT,
    model_name          TEXT        NOT NULL,
    model_version       TEXT        NOT NULL,
    transition_flag     BOOLEAN     DEFAULT FALSE,
    metadata            JSONB,
    generated_at        TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (window_days, date, model_version)
);

3. `rolling_relationship_daily`

One row per (date, window_days, ticker_a, ticker_b) for filtered pairwise relationships. Only store strong or top-k links to avoid table explosion.

CREATE TABLE IF NOT EXISTS rolling_relationship_daily (
    date                DATE        NOT NULL,
    window_days         INT         NOT NULL,
    ticker_a            TEXT        NOT NULL,
    ticker_b            TEXT        NOT NULL,
    corr_value          FLOAT       NOT NULL,
    beta_ratio          FLOAT,
    same_cluster        BOOLEAN,
    generated_at        TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (date, window_days, ticker_a, ticker_b)
);

4. `correlation_cluster_daily`

Cluster assignment for each ticker and horizon.

CREATE TABLE IF NOT EXISTS correlation_cluster_daily (
    date                DATE        NOT NULL,
    window_days         INT         NOT NULL,
    ticker              TEXT        NOT NULL,
    cluster_id          INT         NOT NULL,
    cluster_method      TEXT        NOT NULL,
    regime_label        TEXT,
    centrality_score    FLOAT,
    generated_at        TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (date, window_days, ticker, cluster_method)
);

5. `outlier_event_daily`

Flags unusual names relative to their cluster, regime, or historical profile.

CREATE TABLE IF NOT EXISTS outlier_event_daily (
    id                  BIGSERIAL PRIMARY KEY,
    date                DATE        NOT NULL,
    ticker              TEXT        NOT NULL,
    window_days         INT         NOT NULL,
    outlier_type        TEXT        NOT NULL,
    severity_score      FLOAT       NOT NULL,
    cluster_id          INT,
    regime_label        TEXT,
    explanation         JSONB,
    needs_news_review   BOOLEAN     DEFAULT TRUE,
    generated_at        TIMESTAMPTZ DEFAULT NOW()
);

6. `scenario_forecast_daily`

Stores range-of-outcomes forecasts rather than only point predictions.

CREATE TABLE IF NOT EXISTS scenario_forecast_daily (
    ticker              TEXT        NOT NULL,
    as_of_date          DATE        NOT NULL,
    horizon_days        INT         NOT NULL,
    regime_label        TEXT,
    cluster_id          INT,
    p10_return          FLOAT,
    p25_return          FLOAT,
    p50_return          FLOAT,
    p75_return          FLOAT,
    p90_return          FLOAT,
    expected_return     FLOAT,
    expected_vol        FLOAT,
    downside_prob       FLOAT,
    upside_prob         FLOAT,
    model_name          TEXT        NOT NULL,
    model_version       TEXT        NOT NULL,
    metadata            JSONB,
    generated_at        TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (ticker, as_of_date, horizon_days, model_version)
);

7. `news_review_queue`

This is the bridge between quant flags and qualitative research.

CREATE TABLE IF NOT EXISTS news_review_queue (
    id                  BIGSERIAL PRIMARY KEY,
    date                DATE        NOT NULL,
    ticker              TEXT        NOT NULL,
    source_event_type   TEXT        NOT NULL,
    source_event_id     BIGINT,
    status              TEXT        NOT NULL DEFAULT 'PENDING',
    analyst_notes       TEXT,
    tagged_driver       TEXT,
    reviewed_at         TIMESTAMPTZ,
    created_at          TIMESTAMPTZ DEFAULT NOW()
);

Algorithms By Layer

Layer 1: Regime Features

Use rolling windows of:

Feature families:

returns and log returns
realized volatility
drawdown
distance from moving averages
trend persistence
volume intensity
beta and correlation to QQQ
relative strength versus QQQ
breadth features aggregated at the market level

Primary implementation:

pandas first, possibly polars later if CPU becomes a bottleneck
pure functions, no side effects except DB upserts in runner scripts

Layer 2: Regime Detection

Use a combination of methods, not a single algorithm.

Primary:

Hidden Markov Model on market-level features

Secondary:

Bayesian or offline change-point detection for transition dates
Rolling PCA / factor decomposition for market structure diagnostics

Why this combination:

HMM labels persistent hidden states
change-point detection confirms structural breaks
PCA shows whether leadership is broad or concentrated

Layer 3: Correlated Groups

Primary:

rolling correlation matrix
hierarchical clustering using correlation distance

Secondary:

graph-based community detection later if needed

Distance:

distance(a, b) = sqrt(0.5 * (1 - corr(a, b)))

Store only:

top-k strongest links per ticker
cluster assignment
cluster stability metrics

Layer 4: Outlier Detection

Use multiple outlier types.

Return outliers
- robust z-score versus cluster peers
Correlation outliers
- ticker detaches from normal cluster neighbors
Feature-space outliers
- Mahalanobis distance
- Isolation Forest later if needed

Primary v1 choice:

robust z-score + Mahalanobis distance

Layer 5: Outcome Modeling

This layer should start simple before going deep.

Baseline models first:

conditional historical forward-return tables
quantile regression
random forest / gradient boosting baseline

GPU-enhanced models second:

PyTorch MLP or sequence model for multi-horizon quantile prediction
optional temporal model once the baseline is proven useful

The main objective is not a single directional label. It is a conditional range:

expected return
tail loss
upside potential
regime-conditioned uncertainty

Layer 6: News Cross-Check

Do not let news drive the first-stage signal.

Instead:

detect anomaly quantitatively
enqueue it for review
attach cause tags later
eventually train on those tags as structured metadata

This keeps the quantitative system honest and explainable.

Day-By-Day Plan

Day 7 — Architecture and Schema Freeze

Goal

Lock the project shape before code spreads in too many directions.

Deliverables

create analytics/regime/ and model/outcomes/ module skeletons
write schema files for analytics and outcomes tables
update docker-compose.yml with new profiles and service stubs
write a short runbook for the new pipeline

What to implement

analytics/regime/schema.sql
model/outcomes/schema.sql
base Dockerfiles and pinned requirements
shared DB helpers

Learning goal

Understand why raw data, derived features, hidden states, outlier events, and forecast outputs should live in different tables.

Day 8 — Feature Store and Data QA

Goal

Build a trustworthy feature layer from market_data_daily.

Deliverables

features.py
market_features.py
run_features.py
validation queries and row-count checks

Algorithms

rolling returns
volatility
ATR
RSI
MACD histogram
Bollinger %B
beta/correlation to QQQ
relative strength

Validation

compare features for 3 known tickers manually
ensure no future leakage
confirm warmup periods are handled correctly

Learning goal

See how almost all later models depend on feature correctness more than model complexity.

Day 9 — Regime Detection Engine

Goal

Label market states over 30/60/90/180/360-day windows.

Deliverables

regime_hmm.py
change_points.py
pca_factors.py
run_regimes.py

Algorithms

Gaussian HMM on market-level features
change-point detection on QQQ and breadth features
rolling PCA on standardized cross-sectional returns

Validation

inspect regime labels during 2008, 2020, 2022, 2023, 2024, 2025
ensure state counts are not degenerate
verify transition flags cluster near real structural breaks

Learning goal

Learn the difference between persistent hidden states and abrupt structural breaks.

Day 10 — Correlation Clusters and Market Topology

Goal

Find groups of stocks that move together inside each regime.

Deliverables

correlations.py
clusters.py
run_clusters.py

Algorithms

rolling correlation matrices
hierarchical clustering
cluster stability across windows
simple network statistics such as degree and centrality

Validation

confirm semis, software, mega-cap, consumer, and biotech names form sensible groups
compare cluster maps across bullish and defensive regimes

Learning goal

Understand that regime shifts often change the market graph before they show up in simple return summaries.

Day 11 — Outlier Engine and Review Queue

Goal

Detect names that detach from their peers and deserve attention.

Deliverables

outliers.py
run_outliers.py
summary generation into news_review_queue

Algorithms

robust z-score versus cluster
Mahalanobis distance in feature space
correlation breakdown detector

Validation

backtest known event dates around earnings, AI rallies, shocks, and guidance changes
verify that flagged outliers are interpretable, not just noise

Learning goal

See that abnormal behavior is often more actionable than average behavior.

Day 12 — Scenario Models, Baselines First

Goal

Predict ranges of outcomes instead of only point direction.

Deliverables

targets.py
dataset.py
baseline_models.py
quantile_model.py
train.py

Algorithms

conditional forward return distributions
quantile regression for 5, 20, 60-day horizons
random forest or gradient boosting baseline

Validation

compare quantile calibration
check whether p10/p50/p90 bands are sensible during different regimes
benchmark against naive baselines

Learning goal

Learn why uncertainty bands are usually more useful than raw directional calls.

Day 13 — GPU Models and Universe Training

Goal

Use the DGX deliberately where it adds value.

Deliverables

GPU-enabled scenario training
train_universe.py
model artifact persistence to model_weights
predict.py and predict_universe.py

Algorithms

PyTorch MLP or temporal model for multi-horizon quantiles
batch inference by ticker universe

Validation

compare GPU model against baselines
log calibration and error by regime
do not keep the GPU model if it does not beat the simpler baselines

Learning goal

Learn when deep models actually add value and when they only add complexity.

Day 14 — Reporting, Blog Output, and Operating Rhythm

Goal

Turn the analytics into a repeatable daily research loop.

Deliverables

summarize.py
reporting container
daily markdown intelligence report
dashboard-ready SQL queries

Daily report should answer

current regime by window
strongest clusters
top positive and negative outliers
scenario ranges for key names and groups
queue items for news review

Learning goal

Build the habit of reading model outputs critically instead of treating them as oracles.

Phase 2 After Day 14

Only after the core pipeline is stable:

add market-news ingestion and event tagging
add embeddings or topic clustering for news explanations
add graph neural or sequence models only if the simpler system proves useful
optionally experiment with CUDA-accelerated feature kernels as a learning track

Docker Compose Design

Proposed New Services

regime-features:
  build: ./analytics/regime
  command: ["python", "run_features.py"]
  profiles: ["regime"]

regime-detect:
  build: ./analytics/regime
  command: ["python", "run_regimes.py"]
  profiles: ["regime"]

regime-cluster:
  build: ./analytics/regime
  command: ["python", "run_clusters.py"]
  profiles: ["regime"]

regime-outliers:
  build: ./analytics/regime
  command: ["python", "run_outliers.py"]
  profiles: ["regime"]

scenario-train:
  build: ./model/outcomes
  command: ["python", "train_universe.py"]
  runtime: nvidia
  profiles: ["outcomes"]

scenario-predict:
  build: ./model/outcomes
  command: ["python", "predict_universe.py"]
  runtime: nvidia
  profiles: ["outcomes"]

report-daily:
  build: ./reporting/market_intelligence
  command: ["python", "compose_summary.py"]
  profiles: ["report"]

Shared Volumes

postgres_data
model_weights
optional report_artifacts

Runtime Plan

ingest profile updates raw prices
regime profile recomputes features, states, clusters, outliers
outcomes profile trains or predicts scenario bands
report profile writes a human-readable summary

Validation Framework

The project should be considered healthy only if it passes these checks.

Data Quality

no missing dates beyond expected market holidays
warmup rows handled consistently
no duplicate keys

Statistical Quality

regime labels persist long enough to be meaningful
clusters are stable enough to interpret
outlier flags are sparse and explainable
scenario ranges are calibrated

Operational Quality

all jobs are Dockerized
CPU-heavy jobs parallelize safely
GPU jobs are isolated and reproducible
artifacts are versioned and reusable

Should We Use CUDA / C++?

Not for the first implementation.

For this phase:

use Python
use pandas, numpy, scikit-learn, hmmlearn or equivalent
use PyTorch on the DGX only where it helps

Why:

feature engineering correctness matters more than low-level speed right now
the first bottleneck is research design, not kernel performance
CUDA/C++ is better as a focused learning project after the baseline pipeline is proven

Good future CUDA learning targets:

rolling-window indicator kernels
large-scale Monte Carlo simulation
pairwise correlation acceleration

Definition of Success

This phase succeeds when dgx-trading-system can reliably produce:

multi-horizon regime labels
regime-aware correlation groups
meaningful outlier events
range-of-outcomes forecasts
daily summaries that point you toward what deserves news review

If the system does that, it will help with both:

market understanding
model understanding

That combination is the right foundation before building more aggressive ML.

Day 7 Onward — Market Regime Intelligence Roadmap

Mission

Why This Project Shape

Target Questions

Recommended System Design

Pipeline

Container Roles

Parallelism Strategy

Proposed New Project Layout

Database Design

1. regime_feature_daily

2. market_regime_state

3. rolling_relationship_daily

4. correlation_cluster_daily

5. outlier_event_daily

6. scenario_forecast_daily

7. news_review_queue

Algorithms By Layer

Layer 1: Regime Features

Layer 2: Regime Detection

Layer 3: Correlated Groups

Layer 4: Outlier Detection

Layer 5: Outcome Modeling

Layer 6: News Cross-Check

Day-By-Day Plan

Day 7 — Architecture and Schema Freeze

Goal

Deliverables

What to implement

Learning goal

Day 8 — Feature Store and Data QA

Goal

Deliverables

Algorithms

Validation

Learning goal

Day 9 — Regime Detection Engine

Goal

Deliverables

Algorithms

Validation

Learning goal

Day 10 — Correlation Clusters and Market Topology

Goal

Deliverables

Algorithms

Validation

Learning goal

Day 11 — Outlier Engine and Review Queue

Goal

Deliverables

Algorithms

Validation

Learning goal

Day 12 — Scenario Models, Baselines First

Goal

Deliverables

Algorithms

Validation

Learning goal

Day 13 — GPU Models and Universe Training

Goal

Deliverables

Algorithms

Validation

Learning goal

Day 14 — Reporting, Blog Output, and Operating Rhythm

Goal

Deliverables

Daily report should answer

Learning goal

Phase 2 After Day 14

Docker Compose Design

Proposed New Services

Shared Volumes

Runtime Plan

Validation Framework

Data Quality

Statistical Quality

Operational Quality

1. `regime_feature_daily`

2. `market_regime_state`

3. `rolling_relationship_daily`

4. `correlation_cluster_daily`

5. `outlier_event_daily`

6. `scenario_forecast_daily`

7. `news_review_queue`