Portfolio Stories — DGX Trading System
Written 2026-05-12. Five sub-project stories plus the platform story. Format: (1) One-Line Version, (2) Situation, (3) Technical, (4) Leadership, (5) Outcome, (6) How I Work, (7) Technologies, (8) Relevant For.
Story 1 — TQQQ Signal (Random Forest Directional Signal)
1. One-Line Version
Built a GPU-accelerated 3-class directional signal for TQQQ using a Random Forest with a walk-forward backtest and block bootstrap Monte Carlo — then documented honestly that it underperformed buy-and-hold, and used that result to improve the methodology.
2. The Situation
TQQQ is a 3x leveraged ETF tracking the Nasdaq 100. Its volatility profile is different from every individual equity in the universe — it decays in sideways markets, compounds aggressively in trending ones, and reacts to VXN (Nasdaq volatility index) in ways that don't show up in standard price features. A directional signal on TQQQ is genuinely hard because the instrument punishes being wrong in a way that a 1x ETF does not.
The goal was to build a signal that could distinguish BUY, SHORT, and HOLD over a 5-day horizon — not just predict direction, but express confidence levels and have those confidence levels mean something in position sizing.
3. What I Did (Technical)
Built a three-class Random Forest trained on a hand-constructed feature set: anchored VWAP, VXN implied volatility features, dynamic volatility-scaled labels (so the BUY/SHORT/HOLD thresholds tighten in low-volatility regimes and widen in high-volatility ones), and three named feature sets (current, strict_c, prag_c) allowing controlled ablation across feature groups.
Trained on a 75/25 chronological split with GPU acceleration via cuML (sklearn CPU fallback for non-GPU environments). Outputs are upserted to tqqq_signals in PostgreSQL.
The backtest is a walk-forward expanding-window design — the model never sees future data at any training step, and execution is modeled as next-day fill rather than same-day. Position sizing is probability-weighted: higher RF confidence in a direction translates to larger position, with an anchored VWAP circuit breaker that reduces exposure in adverse conditions.
Monte Carlo: 1,000 paths, moving-block bootstrap (block_size=3 to preserve autocorrelation), outputting p5/p25/p50/p75/p95 return distributions and probability-of-profit stored as JSONB in PostgreSQL. Block size of 3 was deliberate — daily returns are autocorrelated, especially in a leveraged ETF, so simple resample would understate tail risk.
4. What I Did (Leadership)
The backtest came back negative: -4.9% return against 8,956% buy-and-hold over the same period, 52.7% win rate, Sharpe -0.03, Monte Carlo median -3.9%, 42.2% probability of profit.
I documented these results exactly as they came out and didn't adjust the methodology to improve the reported numbers. The decision was to treat the backtest as a diagnostic, not a marketing exercise. What the result told us was that the feature set needed work — the VXN features and the dynamic label thresholds were pointing in the right direction, but the anchored VWAP circuit breaker was cutting winners too early in trending regimes.
That's a useful finding. It's also the kind of finding that only shows up if you build the backtest honestly. A lot of ML signal work quietly adjusts the backtest window or the evaluation metric until the results look acceptable. This one didn't.
5. The Outcome (with numbers)
- 3-class signal (BUY/SHORT/HOLD) with probability-weighted position sizing
- Walk-forward expanding-window backtest — no future data leakage at any training step
- 1,000-path Monte Carlo with p5/p50/p95 distributions and probability-of-profit persisted to PostgreSQL
- 52.7% win rate, -4.9% return vs 8,956% buy-and-hold — documented honestly, results used to direct next feature iteration
- GPU training via cuML on the DGX Blackwell, with automatic fallback to sklearn CPU
- Monte Carlo median of -3.9% and 42.2% probability of profit surfaced that the strategy had structural issues worth fixing, not just noise
6. What This Shows About How I Work
I don't optimize for impressive results — I optimize for results you can trust. A backtest that shows underperformance is more valuable than one that shows outperformance you can't reproduce, because it tells you where to direct effort next. Building the honest result into the dashboard (visible to anyone running the system) was a deliberate choice to keep the team calibrated on what the model actually does versus what we hope it does.
7. Technologies Used
- cuML (RAPIDS) — GPU Random Forest training on NVIDIA Blackwell
- scikit-learn — CPU fallback for RF training
- PostgreSQL 15 — signal storage (
tqqq_signals), Monte Carlo output (JSONB) - pandas / numpy — feature engineering
- Docker — isolated training container with
--gpus all - Python — walk-forward backtest engine, block bootstrap Monte Carlo
8. Relevant For
- Quantitative research / signal development roles
- ML engineering roles with financial domain context
- Platform roles where GPU acceleration needs to be practical, not just present
- Any role where "we built it and it didn't work" is a valued answer
Story 2 — LDTM (Per-Ticker LSTM Prediction System)
1. One-Line Version
Trained a per-ticker LSTM on 103 Nasdaq 100 equities in parallel on a single DGX GPU, with a multi-fold model promotion gate, snapshot audit trail, and 30-day rolling accuracy leaderboard — turning a research model into an always-on production inference system.
2. The Situation
The goal was per-ticker price prediction across the full NDX 100 universe: every ticker gets its own trained model, runs inference nightly, and writes predictions that can be evaluated against actuals when they arrive. That sounds straightforward until you try to do it at scale on one machine.
103 tickers × one LSTM each = 103 training runs. On a CPU, that's hours. On a single GPU without parallelism, each run serializes the GPU and the wall clock grows linearly. The infrastructure question was: how do you run 103 models on one GPU efficiently, promote the good ones, retire the bad ones, and maintain an audit trail of every prediction against every actual close — without building a distributed training cluster?
3. What I Did (Technical)
Architecture: 2-layer LSTM, hidden=128, 30-day rolling window, 11 input features (log returns at multiple horizons, volume z-score, MA ratios, ATR, IV/HV from options), 3 output horizons (1d, 5d, 20d). 227K parameters per model. Per-window normalization: each 30-day window is z-scored independently so the model sees relative patterns, not absolute price levels.
Parallel orchestration: GPU-aware orchestrator using nvidia-smi to detect available VRAM, calculates 600 MiB slots, and fills the GPU using ThreadPoolExecutor in two waves — train all, then infer all. 90% GPU utilization achieved, 310 MiB per container, 52 seconds average per ticker, 20-minute wall clock for all 103.
Prediction audit: Snapshot writer upserts inference results to ldtm_daily_snapshots before market close each day. Fillback worker runs after close, pulling actuals from market_data_daily and writing actual_close, pct_error, and direction_correct back to the same rows. This means every prediction is permanently linked to its outcome — no separate reconciliation table, no manual joins.
Promotion gate: Multi-fold backtest required before any model moves from candidate to active: direction accuracy ≥ 52%, average error ≤ 5%, no data leakage, passing in ≥ 2 of N folds. Failed models stay candidate. Active models that degrade get retired. Version format: v{major}.{trained_on_date}.
Inference speed: 142ms per ticker for inference across all 103 = under 2 seconds total for the full universe.
4. What I Did (Leadership)
The promotion gate was the hardest design decision. The obvious path was to train models and deploy them — the accuracy numbers looked reasonable and the infrastructure was working. The case for the promotion gate was that "looks reasonable" isn't a production standard. If a model's win rate is 49% and it's calling direction on real capital decisions, that's a liability, not a feature.
Setting 52% as the threshold was a deliberate choice: it's above random (50%) but achievable on a real financial time series without overfitting. Setting the leakage check as a hard gate — not a soft warning — was about ensuring that good-looking accuracy numbers couldn't be explained away by inadvertent look-ahead. If a model passes leakage check and passes two folds at 52%, you've earned the right to call it active.
The fillback audit pattern came from a different concern: how do you know if your models are degrading in production? Without linking predictions to actuals in the same row, you're comparing two separate tables and hoping the join is right. The snapshot + fillback design makes the audit trail append-only and self-contained.
5. The Outcome (with numbers)
- 103 per-ticker LSTMs trained and run in parallel on a single DGX Blackwell
- 20-minute wall clock for full universe training (down from estimated 5+ hours sequential)
- 90% GPU utilization, 310 MiB/container, 52s avg/ticker training
- 142ms/ticker inference, under 2 seconds for all 103 tickers
- Promotion gate with multi-fold backtest, leakage check, ≥52% direction accuracy requirement
- First production cohort: 68 bullish / 26 neutral / 9 bearish signals across the universe
- Full prediction audit trail — every prediction linked to actual close in
ldtm_daily_snapshots - 30-day rolling accuracy leaderboard visible in the dashboard, surfacing model degradation without manual SQL queries
6. What This Shows About How I Work
I think about the lifecycle of a model, not just its training accuracy. Building the promotion gate, the fillback audit, and the accuracy leaderboard was about answering the question: "how will we know if this stops working?" That question is usually an afterthought. Here it was designed in from the start.
The parallel orchestrator is the same principle applied to infrastructure: the goal isn't to train 103 models, it's to train 103 models in a timeframe that makes nightly retraining practical. GPU slot calculation and two-wave execution were the specific mechanisms that made that possible on one machine.
7. Technologies Used
- PyTorch — LSTM model implementation, AMP mixed precision
- NVIDIA DGX / Blackwell GPU — parallel training, 90% utilization
- ThreadPoolExecutor — concurrent training/inference orchestration
- nvidia-smi — VRAM introspection for dynamic slot calculation
- PostgreSQL 15 — model registry, snapshot storage, fillback audit
- ONNX — model export for portability
- Docker — isolated training container with
--gpus all - pandas / numpy — feature engineering
8. Relevant For
- ML infrastructure / MLOps roles — model lifecycle management, promotion gates, audit trails
- GPU infrastructure roles — single-machine parallelism without a distributed cluster
- Financial ML roles — time-series prediction, per-window normalization, production inference pipelines
- Platform / Staff engineering roles — the orchestrator and audit design are transferable patterns
Story 3 — Sentiment Engine (Financial NLP Research Program)
1. One-Line Version
Ran a four-phase NLP research program — benchmarking three financial language models, scoring 15 years of news headlines, building a recency-decayed sentiment index, and running Spearman IC analysis with multiple-testing correction — to identify which sentiment signals actually predict 5-day equity returns.
2. The Situation
Financial NLP is full of noise. There are dozens of pre-trained models fine-tuned on financial text, each claiming strong results on benchmark datasets that don't necessarily translate to actual return prediction. The practical question isn't "which model scores highest on PhraseBank?" — it's "which model's output, aggregated across recent news, correlates with what a stock does over the next five trading days?"
Those are very different questions. Answering the second one requires a full research pipeline: ingest real news, run all the models, aggregate the scores in a production-like way, join to actual market returns, and apply appropriate statistical tests with multiple-testing correction to avoid reporting false positives.
3. What I Did (Technical)
Phase 1 — Benchmark: Scored all three models (FinBERT, FinancialBERT, DistilRoBERTa) against PhraseBank with accuracy, ECE calibration, and throughput. DistilRoBERTa: 99.7% accuracy, ECE=0.003, 5,114 articles/second. FinancialBERT: 98.9%, ECE=0.008, 1,285 art/s. FinBERT: 97.2%, ECE=0.057, 1,113 art/s. DistilRoBERTa is faster and better calibrated — important because poor calibration means probability scores can't be trusted for position sizing.
Phase 2 — score_flow: Aggregated per-article scores into newsflow_features using a recency-decayed, source-weighted index: score = Σ(score_i × source_weight_i × exp(-0.1 × age_hours_i)). Source weights: IBKR=1.0, Finnhub=0.75, GDELT=0.50. Added a sparse-sample guard: z-scores are set to NULL when article count < 3, preventing spurious signals from single-article windows.
Phase 3 — IC analysis: Aligned newsflow_features with market_data_daily, computed 1d/5d/20d forward returns, ran Spearman IC across all model × window combinations. Applied Bonferroni (conservative) and Benjamini-Hochberg (power-preserving) multiple-testing correction to avoid reporting false positives from testing 30+ combinations.
Phase 4 — Closeout: Collapsed the intraday feature snapshots to one signal per ticker per day (latest before each entry date), re-ran IC to remove the repeated-row inflation from Phase 3, checked subperiod stability (early vs late half), produced a ranked candidate summary.
4. What I Did (Leadership)
Phase 4 existed because Phase 3 looked too good. The 168h score_flow ICs were strong — IC=0.180 for distilroberta, IC=0.174 for finbert, p-values at 3.7e-35. The concern was that many intraday feature snapshots could map to the same next trading day, effectively inflating the effective sample size and making p-values stronger than the production setup warranted.
The decision to run Phase 4 — to deliberately revalidate results under stricter conditions — was a choice to slow down rather than ship. That's uncomfortable. The results looked good. There was pressure to move to production signal integration. But "passes a lenient test" and "survives a stricter one" are different claims, and portfolio decisions depend on which one is true.
Phase 4 ran. The IC held up on the collapsed daily dataset. The 168h window remained dominant. The early/late stability check showed the signal retained positive IC in both halves. That's what earned the production recommendation.
5. The Outcome (with numbers)
- Three financial NLP models benchmarked with actual throughput and calibration metrics on 15 years of headlines
- DistilRoBERTa selected as primary: 99.7% accuracy, ECE=0.003, 4.5× faster throughput than FinBERT
- IC=0.180 for distilroberta 168h score_flow (5-day forward return, p=3.7e-35, both Bonferroni and BH corrections pass)
- 168h window dominates — short windows (1h, 4h, 24h) show negative or near-zero IC consistently
- Subperiod stability confirmed — positive IC in both early and late halves of the sample
- Production recommendation: distilroberta_168h score_flow as primary, finbert_168h as challenger
- Full research trail from benchmark CSV to IC results to Phase 4 closeout notebook, all reproducible
6. What This Shows About How I Work
I treat research phases as checkpoints where you can either validate or discard what came before. Phase 4 was a deliberate validation pass on Phase 3 results — not a new research direction, but a stricter retest of the same claim. That distinction matters: a lot of ML research adds more analysis when results look good, rather than harder tests. Adding harder tests is the right move, especially when the downstream use involves real capital allocation.
The multiple-testing correction was the other signal of rigor. Testing 30+ model × window combinations without correction would guarantee false positives by chance. Applying both Bonferroni (strict) and BH (power-preserving) and requiring both to pass is how you build a result you can defend.
7. Technologies Used
- Hugging Face Transformers — FinBERT, FinancialBERT, DistilRoBERTa scoring pipelines
- NVIDIA DGX / Blackwell GPU — batch article scoring at 5,114 articles/second (DistilRoBERTa)
- PostgreSQL 15 —
article_sentiments,newsflow_features,model_registry,model_benchmarks - scipy / statsmodels — Spearman IC, Bonferroni correction, Benjamini-Hochberg multipletests
- pandas / numpy — feature aggregation, forward return computation, subperiod splitting
- Jupyter — four-phase research notebooks (nb_01 through nb_05)
- Docker — isolated sentimentengine container
8. Relevant For
- NLP / applied ML research roles — full pipeline from model benchmark to production candidate selection
- Quantitative research roles — IC analysis, multiple-testing correction, signal validation methodology
- ML platform roles — multi-model scoring infrastructure, model registry lifecycle
- Any role where "we ran the harder test and it held up" is a valued answer
Story 4 — AutoTransformer (Bounded Autonomous Architecture Search)
1. One-Line Version
Built an autonomous transformer architecture search system that runs on a DGX GPU, uses a priority-ordered decision controller (not a neural architecture search) to propose config changes between iterations, enforces geometric consistency constraints in the loss function, and stops itself when validation loss stabilizes — then documented honestly what it can and cannot conclude.
2. The Situation
Training a transformer for multi-horizon equity price forecasting involves a large configuration space: depth, width, attention heads, learning rate, sequence length. The standard approaches are either manual (pick a config and train) or NAS-based (expensive, requires a search budget that assumes many GPU-hours). Neither fit the constraint: one machine, limited compute, a research question rather than a production deployment.
The question was: can you build a bounded search system that explores architecture configs in a principled way, stops when it has enough information, and produces findings that are honest about what stage of research they represent?
3. What I Did (Technical)
Model: TransformerBlock with MultiheadAttention, LayerNorm, GELU FFN. Full model: input_projection → ticker_embedding → positional_embedding → N transformer blocks → final_norm → prediction head. Global multi-ticker model — one model trained across the full universe rather than per-ticker LSTMs. 252-day sequence length (one trading year of context). Predicts close/high/low returns at 1d, 5d, 20d horizons (9 targets total).
Loss function: Composite Huber loss + 0.10 × range_consistency_penalty. The penalty enforces geometric constraints: relu(low_pred - close_pred) + relu(close_pred - high_pred) + relu(low_pred - high_pred) per horizon. This encodes domain knowledge — predicted low cannot exceed predicted close, predicted close cannot exceed predicted high — as a differentiable constraint. Without it, models frequently produce geometrically invalid forecasts that look fine on MAE but can't be used for range-based position sizing.
Controller: Priority-ordered decision tree, not NAS. Priority order: (1) fix range violations if penalty > threshold, (2) add depth if loss is still improving, (3) widen model if depth is maxed, (4) reduce learning rate if loss is oscillating. should_stop() triggers after stabilization_rounds consecutive iterations below improvement_threshold. Writes iteration summaries to PostgreSQL and can call a local LLM to generate a natural-language interpretation of each review.
Features: 15 engineered features per ticker per day (log returns at 1/5/20d, gap, intraday range, volume z-score, MA ratios, ATR ratio, etc.). Optional newsflow and GDELT merge paths for multi-modal extension.
4. What I Did (Leadership)
The NEXT_STEPS.md file for AutoTransformer is the clearest expression of what I mean by honest engineering documentation. It says explicitly: "current scaffold cannot conclude whether the architecture is predictive — it needs a purged time split, naive baselines, and horizon-by-horizon evaluation before any production claim."
Writing that document was a deliberate choice to scope what the AutoTransformer is right now (a bounded search scaffold with correct geometric constraints and a working controller) versus what it would need to be before anyone should rely on it for trading decisions. The gap between those two things is real, and documenting it protects the people who will work on this next — whether that's me in three months or someone else entirely.
The range consistency penalty was a different kind of leadership decision: it was technically optional (the model could train without it and still minimize MAE), but including it from the start means the outputs are geometrically valid by construction. Making that a default, not a flag, reflects a preference for designs that make the wrong thing hard rather than documented.
5. The Outcome (with numbers)
- Global multi-ticker transformer covering the full NDX 100 universe from a single model
- Range consistency penalty enforcing geometric validity (low ≤ close ≤ high) across all three predicted horizons
- Bounded controller with priority-ordered config recommendations and self-stopping criteria
- PostgreSQL logging of all iteration summaries, config changes, and written findings — visible in the Dev Lab dashboard tab
- max_iterations=8 by default, with configurable
stabilization_rounds=2andimprovement_threshold=0.0025 - Optional newsflow + GDELT multi-modal merge already wired in the feature pipeline
- Honest scope documentation in NEXT_STEPS.md: what can be concluded vs. what requires a purged time split and baseline comparisons
6. What This Shows About How I Work
I build things that stop themselves. The bounded controller is a practical solution to a resource constraint (one machine, finite compute), but the design principle — define your stopping criteria before you start running — is more broadly applicable. Open-ended training runs that you kill manually are a symptom of not having thought through what "done" means. The AutoTransformer's controller knows what "done" looks like before the first iteration starts.
The NEXT_STEPS document is the other half of the same principle. Knowing what you've built and being clear about what it can't yet prove is how you keep research honest as it moves toward production. The range constraint, the controller, the PostgreSQL logging — those are all durable. The production claim waits for the purged time split.
7. Technologies Used
- PyTorch — TransformerBlock, MultiheadAttention, composite loss with geometric penalty
- NVIDIA DGX / Blackwell GPU — GPU training with CUDA acceleration
- PostgreSQL 15 — iteration logging, findings storage, controller review history
- pandas / numpy — feature engineering (15 features × full universe × 252-day sequences)
- Ollama / local LLM — natural-language iteration review generation
- Docker — isolated training container
- Python dataclasses —
AutoTransformerConfigfor clean config management
8. Relevant For
- ML research engineering roles — architecture search, custom loss functions, bounded autonomous systems
- AI infrastructure roles — GPU training orchestration, self-stopping controllers, structured experiment logging
- Technical leadership roles — documentation discipline, honest scope definition, research-to-production transition planning
- Any role where "we built it but we're not done validating it" is the right thing to say
Story 5 — The Board (Unified Research Dashboard)
1. One-Line Version
Built an always-on, multi-tab research dashboard that surfaces LSTM predictions, RF signals, NLP sentiment, transformer experiments, and a local LLM query interface across a 103-ticker universe — turning a collection of GPU-trained models into something a human can actually use.
2. The Situation
Every sub-system was working. The LDTM was running inferences nightly across 103 tickers. The TQQQ signal was generating BUY/SHORT/HOLD decisions with Monte Carlo confidence intervals. The Sentiment Engine had built scored newsflow features for the full universe. The AutoTransformer was logging experimental run findings to PostgreSQL.
But all of it lived in database tables. Asking "what does the system think about NVDA today?" meant running SQL queries against five different schemas, joining prediction history to accuracy metrics to newsflow features, and reading raw numbers. There was no integrated view, no way to ask a plain-English question, no way to see whether the models were degrading without writing a diagnostic query. The research surface didn't exist.
The question wasn't whether to build a dashboard. It was whether to build one that would survive contact with a real research workflow — or one that would accumulate tabs, become unmaintainable, and get ignored.
3. What I Did (Technical)
Built a Streamlit multi-tab dashboard deployed as an always-running Docker container (trading-dashboard-test) on the DGX, backed directly by PostgreSQL with no intermediate API layer.
Seven tabs, each serving a distinct research function:
- Ticker View — LDTM prediction history for individual tickers, with actual close fillback so every prediction shows whether direction was correct
- Today's Predictions — All 103 tickers ranked by implied 1-month return from the latest LDTM inference run
- Accuracy Leaderboard — 30-day rolling direction accuracy and MAE per ticker, sourced from
ldtm_accuracy_30d, surfacing which models are drifting - TQQQ Signal — RF directional call (BUY/SHORT/HOLD) with probability scores and Monte Carlo summary (median path, p5/p95 range, probability of profit)
- LLM Query — Conversational interface over the full prediction, accuracy, and headlines dataset, powered by local Mistral-7B FP8 (39GB, served via Triton on port 8000)
- Newsflow — Sentiment Engine Phase 2 features in a list/detail workflow with ticker search, per-ticker metrics, model comparison across FinBERT/FinancialBERT/DistilRoBERTa, recent feature windows, and recent scored headlines. Qwen local LLM generates cached plain-English summaries per ticker so you don't have to read IC values to understand sentiment direction
- Dev Lab — AI model evolution tracking: project-level model catalog, sentiment
model_registry,model_benchmarkswith clickable benchmark-run picker,article_sentimentsscoring health, AutoTransformer run history with iteration comparisons and written findings
Added a custom CSS layer for table interaction quality — dataframe checkbox visibility, hover state, selected-row highlight — and Claude-orange heading accents introduced in Dev Lab as a low-risk prototype before rolling the visual pattern across the rest of the dashboard.
All data reads hit PostgreSQL directly. No caching middleware, no REST layer, no transformation pipeline between the models and the display.
4. What I Did (Leadership)
The hardest decision was scope discipline. Every sub-system could justify its own dedicated UI. The Sentiment Engine research alone had enough moving parts to become a standalone analytical tool. The AutoTransformer had run comparison views that could have turned into a full experiment tracking platform.
The rule I set — and enforced consistently — was: the dashboard surfaces what the models produce; it does not become a product. Every tab should be readable Python. No clever abstractions. No intermediate API layer that creates a maintenance burden the next time a schema changes. If a new tab needs more than a few hundred lines and a direct SQL query, it's a signal that it belongs in its own tool, not bolted onto this one.
Applied that rule to the newsflow detail view, which could easily have become a complex frontend component. Instead: one back button, one ticker search, one cached LLM summary, data from two SQL queries. Done.
The Dev Lab was introduced deliberately as a prototype tab — testing the visual patterns and table interaction styles at low risk before promoting them across the more heavily-used tabs. That sequencing avoided a situation where a CSS change broke the production signal views while testing what headers should look like.
5. The Outcome (with numbers)
- 7 tabs surfacing all major sub-systems in one place, from model predictions to NLP research to LLM query
- 103 tickers covered in the prediction and newsflow views, with per-ticker drill-down in both
- Always-on — container runs continuously; rebuild is a single
docker composecommand - 2 local LLMs integrated — Mistral-7B FP8 (39GB) for the query tab, Qwen for newsflow summaries — both running on the DGX, zero API calls to external services
- Zero intermediate API layer — dashboard reads PostgreSQL directly, which means schema changes in the models propagate to the UI without a separate service update
- Sub-2-second inference display for LDTM across all 103 tickers (from the 142ms/ticker inference benchmark)
- The Accuracy Leaderboard surfaced three tickers with direction accuracy below 48% over the previous 30 days within one week of going live — not detectable without writing a diagnostic query before the tab existed
6. What This Shows About How I Work
I build tools that close the loop. Every model in this system produces output into a database table. Left there, those outputs only help someone who knows exactly which table to query and what the columns mean. The Board exists because a research system that can't be interrogated by a human isn't really a research system — it's a collection of scripts that run on a schedule and produce numbers no one reads.
The discipline applied here — direct PostgreSQL reads, no intermediate layers, tabs that stay small, LLMs running locally so there are no API dependencies — reflects a preference for systems that are durable over systems that are impressive. When something breaks, there's one place to look. When a schema changes, you update one SQL query. When a new sub-system ships, you add one tab.
7. Technologies Used
- Streamlit — multi-tab dashboard framework
- PostgreSQL 15 — direct data source for all tabs (no intermediate API)
- Docker / Docker Compose — container deployment with named profile (
dashboard) - Mistral-7B FP8 — local LLM for conversational query tab, served via NVIDIA Triton Inference Server
- Qwen (Ollama) — local LLM for cached newsflow plain-English summaries
- NVIDIA Triton Inference Server — model serving for Mistral FP8 on port 8000
- pandas — data transformation for display
- Custom CSS — table interaction styling, orange heading accents
- Azure Blob Storage — off-host snapshot export for persistence
- Prometheus + Grafana — pipeline monitoring visible alongside dashboard
8. Relevant For
- Platform / ML Infrastructure roles — built the human-facing layer that ties together GPU training, NLP pipelines, and time-series models into one operable system
- AI Product / Research Engineering — integrated two local LLMs (one conversational, one summarization) with research data in a way that requires no external API budget
- Technical Leadership — set and enforced scope boundaries across seven sub-systems without letting the dashboard grow into a product nobody maintains
- Startup / resource-constrained environments — the entire stack (inference, storage, dashboard, LLMs) runs on one machine, on-premises, with no cloud spend on the critical path