Project: Sentiment Engine — Financial NLP Research Program

One-Line Version

Designed and executed a four-phase financial NLP research program — benchmarking, feature engineering, statistical signal validation — that identified a sentiment-derived trading signal with statistically significant predictive power at the 5-day return horizon, surviving both Bonferroni and Benjamini-Hochberg multiple-testing corrections. The methodology was designed so the result, positive or negative, could be defended under scrutiny.

The Situation

The trading system already had price-based models on 103 tickers. The question was whether financial news sentiment could add independent signal — not as a narrative overlay, but as a measurable predictor of forward returns that survives proper statistical testing.

The standard way to answer that question badly is to pick a sentiment model, score some headlines, plot against price, and declare victory. The right way is to treat it as a research program: establish data quality before scoring, benchmark models before deploying them, build a feature store that earns the right to be tested, and apply statistical corrections that prevent self-deception when testing dozens of model/window combinations simultaneously.

Three data sources were already in the database: IB/Dow Jones news (highest credibility), Finnhub, and GDELT macro event data. The challenge was turning that raw corpus into something testable — with enough rigor that the result of the test, whichever direction it went, could be trusted.

What I Built — Technical

Phase 0 — Data Quality Before Scoring

Before scoring a single article, ran a data exploration pass to understand what was actually in the corpus. IB/Dow Jones articles carry metadata prefixes (DJ-) and HTML disclaimer boilerplate in article bodies — noise that would corrupt tokenization and confuse any transformer model. Wrote an ibkr_cleaner that strips these in-place on existing rows and patches the write path so new ingestion arrives clean.

Profiled token length distributions across sources. 95th-percentile headline length: 30–58 tokens — well within transformer context windows. Body text was more variable and deferred to a later phase. This confirmed that headline-level scoring was the right first target, rather than discovering it after running inference on 60,000 articles.

The lesson from Phase 0: data quality is not a preprocessing step, it is the prerequisite that determines whether the research is worth running.

Phase 1 — Model Benchmarking Before Production Use

Benchmarked three financial-domain NLP models on the Financial PhraseBank all-agree split — the gold-standard subset where all human annotators agreed on the label, minimizing label noise:

Model	Accuracy	F1 (macro)	ECE	Throughput (GPU, batch 32)
DistilRoBERTa	99.7%	99.6%	0.003	5,114 articles/sec
FinancialBERT	98.9%	98.6%	0.008	1,285 articles/sec
FinBERT	97.2%	96.3%	0.057	1,113 articles/sec

DistilRoBERTa is the clear winner on accuracy, calibration error, and throughput — roughly 4× faster than FinBERT at batch 32 on the DGX GPU. FinBERT's ECE of 0.057 means its confidence scores are meaningfully less reliable, which matters when confidence is used to filter or weight article scores. Benchmark results are stored in model_benchmarks alongside all metadata — device, batch size, split name, run ID — so the comparison is reproducible and queryable from the dashboard.

After benchmarking, scored 10,000 random headlines per source × 3 models = 60,000 scoring events into article_sentiments. Score formula: score = P_positive − P_negative ∈ [−1, +1], with ±0.05 neutral band. Continuous scores stored alongside class probabilities so downstream research can apply different thresholds without rescoring.

Phase 2 — Newsflow Feature Store

Turned article-level scores into ticker-level rolling features across five time windows: 1h, 4h, 24h, 72h, and 168h.

The core feature is score_flow — not a simple mean, but a recency-decayed, source-weighted sentiment index:

score_flow = Σ (decay × source_weight × score) / Σ (decay × source_weight)

decay = exp(−0.1 × age_hours)

source weights: IBKR = 1.0 · Finnhub = 0.75 · GDELT = 0.50

Source credibility weights reflect the editorial filtering that professional wire services apply versus aggregated web sources versus macro event counts. Recency decay at λ=0.1 per hour means a 7-hour-old article retains ~50% weight — fast enough to respond to news flow, slow enough not to collapse on sparse days.

Additional features per window: score_raw (unweighted mean), positive_ratio, negative_ratio, neutral_ratio, velocity (score_flow delta vs prior window), source_disagreement (cross-source standard deviation — how much IB and Finnhub agree), and score_zscore and article_count_zscore against 30-day rolling baselines.

Critical design decision: z-scores are set to NULL when article_count < 3. A z-score computed on one or two articles looks precise but is statistically meaningless. A NULL is safer for downstream ranking and dashboard display than a misleading value.

Phase 3 — IC-Based Signal Validation

This is the phase where the research either earns credibility or doesn't.

Built an aligned research dataset joining newsflow_features to market_data_daily with a 1-trading-day entry lag — signal observed at window close, entry at next open. Computed forward returns at 1-day, 5-day, and 20-day horizons. Applied Spearman rank Information Coefficient rather than Pearson — financial returns are not normally distributed, and rank correlation is more robust to the fat tails that dominate equity return distributions.

Tested 90 combinations: 3 models × 5 windows × 2 signal features × 3 forward return horizons. With 90 simultaneous tests, raw p-values are nearly meaningless — the expected number of false positives at α=0.05 without correction is 4.5. Applied both Bonferroni correction (most conservative — controls family-wise error rate) and Benjamini-Hochberg FDR correction (controls expected false discovery rate). Only combinations surviving both corrections are treated as credible.

What the IC analysis found: The 168-hour (7-day) window dominates across all models and horizons. Short windows (1h, 4h) show near-zero or negative IC — the immediate market reaction to news is already priced. The predictive content lives in the medium-term sentiment trend, not the individual article.

Top results at the 5-day forward return horizon:

Signal	IC	p-value	Bonferroni	BH
distilroberta_168h score_flow	0.180	3.7×10⁻³⁵	✓	✓
finbert_168h score_flow	0.174	5.1×10⁻³³	✓	✓
financialbert_168h score_flow	0.100	8.8×10⁻¹²	✓	✓
finbert_72h score_flow	0.041	3.3×10⁻⁵	✓	✓
distilroberta_72h score_flow	0.035	3.6×10⁻⁴	✓	✓

The counterintuitive finding: DistilRoBERTa — the faster, lighter model — produces a stronger IC (0.180) than FinancialBERT (0.100) despite FinancialBERT's higher PhraseBank accuracy. The benchmark measures label agreement with human annotators on financial text. The IC measures correlation with actual future price movement. These are different things. A proxy benchmark that looks better on the evaluation set does not guarantee better performance on the target task — in this case, predicting where prices go. This finding is documented explicitly in the architecture file as a reminder that benchmark results don't transfer automatically.

Built supplementary analyses: a 15×15 signal correlation matrix (to identify redundant model/window combinations), quintile bucket return tables (positions sorted by signal score decile, average forward return per bucket), and extreme z-score event studies (performance where sentiment z-score exceeds ±2σ).

Phase 4 — Closeout and Production Candidate Selection

Phase 3 used all available intraday feature snapshots, which could map multiple observations to the same forward return — inflating effective sample size and making p-values look stronger than the actual production setup warrants.

Phase 4 collapsed to one signal per ticker per trading day (keeping the latest available snapshot before each entry date), reran the full IC analysis on the collapsed dataset, and split the sample into early and late halves to test regime stability. Success criteria: the collapsed dataset shows a clear leader, 168h remains stronger than short windows, the top candidate retains positive IC in both subperiods, bucket ordering remains directionally sensible.

The closeout phase exists precisely because a research program that inflates its own results before handing off to production is not a research program — it is a pitch. Phase 4 is the step that earns the right to make a production recommendation.

Production recommendation:

Primary: distilroberta_168h score_flow — strongest IC, fastest inference, best calibrated
Challenger: finbert_168h score_flow — second strongest IC, useful for monitoring signal agreement

Schema and Infrastructure

Four-table schema: article_sentiments (per-article scores with full probability distribution, latency, model version, scored text type), newsflow_features (ticker/window aggregates with all computed fields), model_benchmarks (benchmark result cache, queryable from dashboard), model_registry (model metadata including F1, ECE, throughput). A unified view v_articles joins IB and Finnhub sources for consistent article queries. All tables use idempotent unique constraints so rescoring and backfilling are safe to rerun.

What I Built — Leadership and Judgment

Structured a research program, not a research hack. The four-phase sequencing was deliberate: data quality before scoring, benchmarking before production use, feature engineering before signal testing, a trust-building closeout before production recommendation. Each phase had documented prerequisites and exit criteria. Skipping straight to IC analysis on uncleaned data with an unbenchmarked model would have produced faster-looking results that couldn't be trusted. The slower path produces results that can.

Applied multiple-testing correction before anyone asked. With 90 combinations tested simultaneously, not applying Bonferroni and BH correction would have produced approximately 4.5 false positives at face value. Using both corrections was a decision to communicate results that could withstand scrutiny — not just internally, but in a future conversation with a portfolio manager or quant reviewer who knows what multiple testing means. The correction was not imposed by a reviewer after the fact.

Documented the counterintuitive findings explicitly. DistilRoBERTa outperforming FinancialBERT on IC despite lower PhraseBank accuracy is a finding, not an implementation detail. It is in the architecture file as a permanent reminder that proxy benchmarks diverge from production performance. The findings that challenge assumptions are the ones that get quietly dropped when research is done informally. They are the ones worth keeping.

Designed Phase 4 as a trust mechanism, not a formality. The closeout phase exists to deflate sample size inflation before a production recommendation is made. That is not a step most individual researchers include, because it typically makes the results look less impressive. It is included here because the goal was a result that could be acted on, not a result that looked good in a presentation.

Results

Metric	Value
Articles scored (Phase 1 sample)	60,000 (10k × 3 models × 2 sources)
Models benchmarked	3 (FinBERT, FinancialBERT, DistilRoBERTa)
Best PhraseBank accuracy	99.7% (DistilRoBERTa)
GPU throughput (DistilRoBERTa, batch 32)	5,114 articles/sec
Time windows tested	5 (1h, 4h, 24h, 72h, 168h)
Signal combinations tested	90
Best IC — 5-day forward return	0.180 (distilroberta_168h score_flow)
Statistical significance	p = 3.7×10⁻³⁵, surviving both Bonferroni and BH correction
Signals surviving both corrections	6 at 5-day horizon · 3 at 1-day · 3 at 20-day
Production candidate selected	distilroberta_168h score_flow
Dashboard integration	Live Newsflow tab — ticker search, flow metrics, LLM plain-English interpretation
Nightly automation	Incremental Finnhub + GDELT refresh, cron at 18:45 weekdays

What This Shows About How I Work

I know the difference between a result that looks good and a result that is good, and I build the infrastructure to tell them apart. Applying multiple-testing correction before presenting IC results, nulling z-scores on sparse windows, and running a separate closeout phase to deflate inflated sample sizes are all decisions that reduce the apparent impressiveness of the numbers in exchange for results that can actually be acted on. That trade-off — between the story you can tell quickly and the one you can defend under scrutiny — is where the character of a research program shows.

I choose defensible.

Technologies

Python 3.11 · Hugging Face Transformers · FinBERT · FinancialBERT · DistilRoBERTa · NVIDIA GB10 Blackwell · CUDA · GPU batch inference · SciPy (spearmanr) · statsmodels (multipletests — Bonferroni + Benjamini-Hochberg) · pandas · NumPy · PostgreSQL 15 (article scores, feature store, benchmark cache, model registry) · SQLAlchemy (merge_asof for forward-return alignment) · JupyterLab (containerised phased research notebooks) · Streamlit · Qwen (local LLM for plain-English newsflow summaries) · Interactive Brokers · Finnhub · GDELT · Docker · Cron

Relevant For

Role	Why This Story Fits
Quantitative Researcher / Quant Engineer	IC methodology, Spearman rank correlation, multiple-testing correction, signal redundancy matrix, subperiod stability testing
ML Engineer (NLP / Applied)	Model benchmarking before deployment, ECE calibration, throughput profiling, source credibility weighting, feature store design
Head of AI / Research Lead	Phased research program with documented prerequisites, exit criteria, and trust-building closeout before production recommendation
Staff / Principal Engineer	Sparse-sample guards, idempotent schema, multi-source pipeline, automated nightly refresh, deflation of inflated sample sizes
Data Scientist (Finance)	End-to-end signal development: raw corpus → cleaning → benchmarking → feature engineering → IC validation → production candidate
Technical Founding Role	Solo-built research program with the statistical discipline of a dedicated quant research team