Day 7 Onward — Market Regime Intelligence Roadmap
Mission
Build a new analytics project inside dgx-trading-system that turns 20 years of
Nasdaq-100 daily OHLCV history into:
- market regime labels across multiple horizons
- groups of correlated stocks that behave similarly inside each regime
- outlier detection for names that detach from their group
- range-of-outcomes forecasts for stocks and clusters
- a workflow for cross-checking statistical anomalies against market news
This roadmap starts at Day 7 and continues forward from the ingestion work
already completed in ingestion/.
The guiding goal is not only to produce forecasts, but also to build a system that helps explain:
- what regime the market is in
- which stocks are moving together
- which names are unusually strong or weak
- what outcomes historically followed similar conditions
Why This Project Shape
The system should separate concerns cleanly:
ingestion/loads raw market data and stays unchangedanalytics/regime/creates market features, regime labels, clusters, and outlier eventsmodel/outcomes/trains forecasting and scenario modelsreporting/produces summaries, blog-ready outputs, and later LLM/news context
This makes the pipeline modular, easier to test, and easier to reason about.
Target Questions
By the end of this phase, the system should answer:
- What market regime are we currently in over 30/60/90/180/360-day windows?
- Which stocks are tightly linked in the current regime?
- Which stocks are behaving abnormally relative to their peers?
- What forward return ranges historically followed similar setups?
- Which flagged outliers deserve manual or automated news review?
Recommended System Design
Pipeline
market_data_daily
|
v
regime feature engineering (CPU, parallel across tickers / windows)
|
+--> market-level regime detection (CPU)
|
+--> rolling correlation + clustering (CPU)
|
+--> outlier detection (CPU)
|
v
scenario feature store
|
+--> baseline forecasting models (CPU)
|
+--> deep / GPU scenario models (NVIDIA)
|
v
scenario forecasts + daily intelligence summary
|
v
news cross-check / analyst review
Container Roles
ingestion: CPU, existing, unchangedregime-features: CPU, parallel feature computation and DB upsertsregime-detect: CPU, HMM / change-point / PCA jobsregime-cluster: CPU, rolling correlations and clustering jobsregime-outliers: CPU, robust outlier detection and event generationscenario-train: GPU-enabled, trains probabilistic outcome modelsscenario-predict: GPU-enabled, writes forward scenario rangesreport-daily: CPU, composes summaries for review and later LLM/news layers
Parallelism Strategy
- Parallelize per-ticker and per-window feature generation on CPU
- Parallelize rolling-correlation jobs by window on CPU
- Keep heavy GPU training controlled and sequential per model family
- Allow batched GPU inference once artifacts are stable
This gives us real parallel throughput without fighting for the GPU.
Proposed New Project Layout
dgx-trading-system/
analytics/
regime/
Dockerfile
requirements.txt
tickers.py
db.py
windows.py
features.py
market_features.py
regime_hmm.py
change_points.py
pca_factors.py
correlations.py
clusters.py
outliers.py
run_features.py
run_regimes.py
run_clusters.py
run_outliers.py
summarize.py
schema.sql
model/
outcomes/
Dockerfile
requirements.txt
dataset.py
targets.py
baseline_models.py
quantile_model.py
montecarlo.py
train.py
train_universe.py
predict.py
predict_universe.py
schema.sql
reporting/
market_intelligence/
Dockerfile
requirements.txt
compose_summary.py
render_markdown.py
docs/
day-07-regime-intelligence-roadmap.md
market-regime-intelligence-blog.md
Database Design
These tables keep raw data, derived analytics, forecasts, and review metadata separate. That separation matters for debugging and for learning the models.
1. regime_feature_daily
One row per (ticker, date), holding row-level engineered features.
CREATE TABLE IF NOT EXISTS regime_feature_daily (
ticker TEXT NOT NULL,
date DATE NOT NULL,
log_return_1d FLOAT,
log_return_5d FLOAT,
log_return_20d FLOAT,
realized_vol_20d FLOAT,
realized_vol_60d FLOAT,
atr_14 FLOAT,
rsi_14 FLOAT,
macd_hist FLOAT,
bb_pct_b FLOAT,
volume_ratio_20d FLOAT,
hl_range FLOAT,
close_to_open FLOAT,
roc_5 FLOAT,
distance_sma_20 FLOAT,
distance_sma_60 FLOAT,
beta_qqq_60 FLOAT,
corr_qqq_60 FLOAT,
rel_strength_qqq_20 FLOAT,
max_drawdown_60 FLOAT,
updated_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (ticker, date)
);
2. market_regime_state
One row per (window_days, date) representing the market-level hidden state.
CREATE TABLE IF NOT EXISTS market_regime_state (
window_days INT NOT NULL,
date DATE NOT NULL,
regime_label TEXT NOT NULL,
regime_id INT NOT NULL,
regime_probability FLOAT,
model_name TEXT NOT NULL,
model_version TEXT NOT NULL,
transition_flag BOOLEAN DEFAULT FALSE,
metadata JSONB,
generated_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (window_days, date, model_version)
);
3. rolling_relationship_daily
One row per (date, window_days, ticker_a, ticker_b) for filtered pairwise
relationships. Only store strong or top-k links to avoid table explosion.
CREATE TABLE IF NOT EXISTS rolling_relationship_daily (
date DATE NOT NULL,
window_days INT NOT NULL,
ticker_a TEXT NOT NULL,
ticker_b TEXT NOT NULL,
corr_value FLOAT NOT NULL,
beta_ratio FLOAT,
same_cluster BOOLEAN,
generated_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (date, window_days, ticker_a, ticker_b)
);
4. correlation_cluster_daily
Cluster assignment for each ticker and horizon.
CREATE TABLE IF NOT EXISTS correlation_cluster_daily (
date DATE NOT NULL,
window_days INT NOT NULL,
ticker TEXT NOT NULL,
cluster_id INT NOT NULL,
cluster_method TEXT NOT NULL,
regime_label TEXT,
centrality_score FLOAT,
generated_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (date, window_days, ticker, cluster_method)
);
5. outlier_event_daily
Flags unusual names relative to their cluster, regime, or historical profile.
CREATE TABLE IF NOT EXISTS outlier_event_daily (
id BIGSERIAL PRIMARY KEY,
date DATE NOT NULL,
ticker TEXT NOT NULL,
window_days INT NOT NULL,
outlier_type TEXT NOT NULL,
severity_score FLOAT NOT NULL,
cluster_id INT,
regime_label TEXT,
explanation JSONB,
needs_news_review BOOLEAN DEFAULT TRUE,
generated_at TIMESTAMPTZ DEFAULT NOW()
);
6. scenario_forecast_daily
Stores range-of-outcomes forecasts rather than only point predictions.
CREATE TABLE IF NOT EXISTS scenario_forecast_daily (
ticker TEXT NOT NULL,
as_of_date DATE NOT NULL,
horizon_days INT NOT NULL,
regime_label TEXT,
cluster_id INT,
p10_return FLOAT,
p25_return FLOAT,
p50_return FLOAT,
p75_return FLOAT,
p90_return FLOAT,
expected_return FLOAT,
expected_vol FLOAT,
downside_prob FLOAT,
upside_prob FLOAT,
model_name TEXT NOT NULL,
model_version TEXT NOT NULL,
metadata JSONB,
generated_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY (ticker, as_of_date, horizon_days, model_version)
);
7. news_review_queue
This is the bridge between quant flags and qualitative research.
CREATE TABLE IF NOT EXISTS news_review_queue (
id BIGSERIAL PRIMARY KEY,
date DATE NOT NULL,
ticker TEXT NOT NULL,
source_event_type TEXT NOT NULL,
source_event_id BIGINT,
status TEXT NOT NULL DEFAULT 'PENDING',
analyst_notes TEXT,
tagged_driver TEXT,
reviewed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Algorithms By Layer
Layer 1: Regime Features
Use rolling windows of:
- 30
- 60
- 90
- 180
- 360
Feature families:
- returns and log returns
- realized volatility
- drawdown
- distance from moving averages
- trend persistence
- volume intensity
- beta and correlation to QQQ
- relative strength versus QQQ
- breadth features aggregated at the market level
Primary implementation:
pandasfirst, possiblypolarslater if CPU becomes a bottleneck- pure functions, no side effects except DB upserts in runner scripts
Layer 2: Regime Detection
Use a combination of methods, not a single algorithm.
Primary:
- Hidden Markov Model on market-level features
Secondary:
- Bayesian or offline change-point detection for transition dates
- Rolling PCA / factor decomposition for market structure diagnostics
Why this combination:
- HMM labels persistent hidden states
- change-point detection confirms structural breaks
- PCA shows whether leadership is broad or concentrated
Layer 3: Correlated Groups
Primary:
- rolling correlation matrix
- hierarchical clustering using correlation distance
Secondary:
- graph-based community detection later if needed
Distance:
distance(a, b) = sqrt(0.5 * (1 - corr(a, b)))
Store only:
- top-k strongest links per ticker
- cluster assignment
- cluster stability metrics
Layer 4: Outlier Detection
Use multiple outlier types.
-
Return outliers
- robust z-score versus cluster peers
-
Correlation outliers
- ticker detaches from normal cluster neighbors
-
Feature-space outliers
- Mahalanobis distance
- Isolation Forest later if needed
Primary v1 choice:
- robust z-score + Mahalanobis distance
Layer 5: Outcome Modeling
This layer should start simple before going deep.
Baseline models first:
- conditional historical forward-return tables
- quantile regression
- random forest / gradient boosting baseline
GPU-enhanced models second:
- PyTorch MLP or sequence model for multi-horizon quantile prediction
- optional temporal model once the baseline is proven useful
The main objective is not a single directional label. It is a conditional range:
- expected return
- tail loss
- upside potential
- regime-conditioned uncertainty
Layer 6: News Cross-Check
Do not let news drive the first-stage signal.
Instead:
- detect anomaly quantitatively
- enqueue it for review
- attach cause tags later
- eventually train on those tags as structured metadata
This keeps the quantitative system honest and explainable.
Day-By-Day Plan
Day 7 — Architecture and Schema Freeze
Goal
Lock the project shape before code spreads in too many directions.
Deliverables
- create
analytics/regime/andmodel/outcomes/module skeletons - write schema files for analytics and outcomes tables
- update
docker-compose.ymlwith new profiles and service stubs - write a short runbook for the new pipeline
What to implement
analytics/regime/schema.sqlmodel/outcomes/schema.sql- base Dockerfiles and pinned requirements
- shared DB helpers
Learning goal
Understand why raw data, derived features, hidden states, outlier events, and forecast outputs should live in different tables.
Day 8 — Feature Store and Data QA
Goal
Build a trustworthy feature layer from market_data_daily.
Deliverables
features.pymarket_features.pyrun_features.py- validation queries and row-count checks
Algorithms
- rolling returns
- volatility
- ATR
- RSI
- MACD histogram
- Bollinger %B
- beta/correlation to QQQ
- relative strength
Validation
- compare features for 3 known tickers manually
- ensure no future leakage
- confirm warmup periods are handled correctly
Learning goal
See how almost all later models depend on feature correctness more than model complexity.
Day 9 — Regime Detection Engine
Goal
Label market states over 30/60/90/180/360-day windows.
Deliverables
regime_hmm.pychange_points.pypca_factors.pyrun_regimes.py
Algorithms
- Gaussian HMM on market-level features
- change-point detection on QQQ and breadth features
- rolling PCA on standardized cross-sectional returns
Validation
- inspect regime labels during 2008, 2020, 2022, 2023, 2024, 2025
- ensure state counts are not degenerate
- verify transition flags cluster near real structural breaks
Learning goal
Learn the difference between persistent hidden states and abrupt structural breaks.
Day 10 — Correlation Clusters and Market Topology
Goal
Find groups of stocks that move together inside each regime.
Deliverables
correlations.pyclusters.pyrun_clusters.py
Algorithms
- rolling correlation matrices
- hierarchical clustering
- cluster stability across windows
- simple network statistics such as degree and centrality
Validation
- confirm semis, software, mega-cap, consumer, and biotech names form sensible groups
- compare cluster maps across bullish and defensive regimes
Learning goal
Understand that regime shifts often change the market graph before they show up in simple return summaries.
Day 11 — Outlier Engine and Review Queue
Goal
Detect names that detach from their peers and deserve attention.
Deliverables
outliers.pyrun_outliers.py- summary generation into
news_review_queue
Algorithms
- robust z-score versus cluster
- Mahalanobis distance in feature space
- correlation breakdown detector
Validation
- backtest known event dates around earnings, AI rallies, shocks, and guidance changes
- verify that flagged outliers are interpretable, not just noise
Learning goal
See that abnormal behavior is often more actionable than average behavior.
Day 12 — Scenario Models, Baselines First
Goal
Predict ranges of outcomes instead of only point direction.
Deliverables
targets.pydataset.pybaseline_models.pyquantile_model.pytrain.py
Algorithms
- conditional forward return distributions
- quantile regression for 5, 20, 60-day horizons
- random forest or gradient boosting baseline
Validation
- compare quantile calibration
- check whether p10/p50/p90 bands are sensible during different regimes
- benchmark against naive baselines
Learning goal
Learn why uncertainty bands are usually more useful than raw directional calls.
Day 13 — GPU Models and Universe Training
Goal
Use the DGX deliberately where it adds value.
Deliverables
- GPU-enabled scenario training
train_universe.py- model artifact persistence to
model_weights predict.pyandpredict_universe.py
Algorithms
- PyTorch MLP or temporal model for multi-horizon quantiles
- batch inference by ticker universe
Validation
- compare GPU model against baselines
- log calibration and error by regime
- do not keep the GPU model if it does not beat the simpler baselines
Learning goal
Learn when deep models actually add value and when they only add complexity.
Day 14 — Reporting, Blog Output, and Operating Rhythm
Goal
Turn the analytics into a repeatable daily research loop.
Deliverables
summarize.py- reporting container
- daily markdown intelligence report
- dashboard-ready SQL queries
Daily report should answer
- current regime by window
- strongest clusters
- top positive and negative outliers
- scenario ranges for key names and groups
- queue items for news review
Learning goal
Build the habit of reading model outputs critically instead of treating them as oracles.
Phase 2 After Day 14
Only after the core pipeline is stable:
- add market-news ingestion and event tagging
- add embeddings or topic clustering for news explanations
- add graph neural or sequence models only if the simpler system proves useful
- optionally experiment with CUDA-accelerated feature kernels as a learning track
Docker Compose Design
Proposed New Services
regime-features:
build: ./analytics/regime
command: ["python", "run_features.py"]
profiles: ["regime"]
regime-detect:
build: ./analytics/regime
command: ["python", "run_regimes.py"]
profiles: ["regime"]
regime-cluster:
build: ./analytics/regime
command: ["python", "run_clusters.py"]
profiles: ["regime"]
regime-outliers:
build: ./analytics/regime
command: ["python", "run_outliers.py"]
profiles: ["regime"]
scenario-train:
build: ./model/outcomes
command: ["python", "train_universe.py"]
runtime: nvidia
profiles: ["outcomes"]
scenario-predict:
build: ./model/outcomes
command: ["python", "predict_universe.py"]
runtime: nvidia
profiles: ["outcomes"]
report-daily:
build: ./reporting/market_intelligence
command: ["python", "compose_summary.py"]
profiles: ["report"]
Shared Volumes
postgres_datamodel_weights- optional
report_artifacts
Runtime Plan
ingestprofile updates raw pricesregimeprofile recomputes features, states, clusters, outliersoutcomesprofile trains or predicts scenario bandsreportprofile writes a human-readable summary
Validation Framework
The project should be considered healthy only if it passes these checks.
Data Quality
- no missing dates beyond expected market holidays
- warmup rows handled consistently
- no duplicate keys
Statistical Quality
- regime labels persist long enough to be meaningful
- clusters are stable enough to interpret
- outlier flags are sparse and explainable
- scenario ranges are calibrated
Operational Quality
- all jobs are Dockerized
- CPU-heavy jobs parallelize safely
- GPU jobs are isolated and reproducible
- artifacts are versioned and reusable
Should We Use CUDA / C++?
Not for the first implementation.
For this phase:
- use Python
- use
pandas,numpy,scikit-learn,hmmlearnor equivalent - use PyTorch on the DGX only where it helps
Why:
- feature engineering correctness matters more than low-level speed right now
- the first bottleneck is research design, not kernel performance
- CUDA/C++ is better as a focused learning project after the baseline pipeline is proven
Good future CUDA learning targets:
- rolling-window indicator kernels
- large-scale Monte Carlo simulation
- pairwise correlation acceleration
Definition of Success
This phase succeeds when dgx-trading-system can reliably produce:
- multi-horizon regime labels
- regime-aware correlation groups
- meaningful outlier events
- range-of-outcomes forecasts
- daily summaries that point you toward what deserves news review
If the system does that, it will help with both:
- market understanding
- model understanding
That combination is the right foundation before building more aggressive ML.