The Three Data Sources
This project uses three distinct data streams. Each has different frequency, different coverage, and different latency characteristics.
| Source | Frequency | Coverage | Latency | Provider |
|---|---|---|---|---|
| NASDAQ-100 price/volume | Daily OHLCV | 20 years | End-of-day | Interactive Brokers |
| Options chain | Hourly | ~2 years | Real-time / 15min delay | Interactive Brokers |
| News sentiment | Event-driven → daily aggregated | ~1-2 years | Near real-time | FinHub, GDELT, IB |
The fundamental alignment problem: these sources have different granularities, different start dates, and different update patterns. Getting them into a single daily DataFrame per ticker is the foundation for everything that follows.
Step 1: Establish the Master Calendar
The first architectural decision is: what is the master time index? Every other data source gets aligned to this.
For NASDAQ-100 trading, the master index is the set of all US equity trading days — excluding weekends and market holidays. We derive this from the price data itself since that data is only available on trading days.
import pandas as pd
import pandas_market_calendars as mcal
# Generate master trading calendar
nyse = mcal.get_calendar('NYSE')
schedule = nyse.schedule(start_date='2004-01-01', end_date='2024-12-31')
trading_days = mcal.date_range(schedule, frequency='1D')
# Convert to date index (no time component for daily data)
master_index = pd.DatetimeIndex([d.date() for d in trading_days])
# Result: ~5,040 trading days over 20 years
Step 2: NASDAQ-100 Price Data
Daily OHLCV (Open, High, Low, Close, Volume) for the NASDAQ-100 universe is the most straightforward data stream. For this project, the universe is the current NASDAQ-100 constituents — 103 tickers including their options-active names.
Structuring the DataFrame
The cleanest structure for a multi-ticker time series system is a long-format DataFrame: one row per (date, ticker) pair. This makes cross-ticker operations, filtering, and merging straightforward.
# Long format: (date, ticker) → features
#
# date ticker open high low close volume
# 2024-01-02 AAPL 184.22 185.88 183.43 185.52 6.8e7
# 2024-01-02 MSFT 374.01 376.54 373.21 376.04 2.1e7
# 2024-01-02 NVDA 495.22 498.44 491.11 495.72 4.2e7
# Index: date + ticker as MultiIndex
df = df.set_index(['date', 'ticker'])
Computing daily returns
Raw price levels are non-stationary and should not be fed directly into a model. We compute log returns per ticker:
import numpy as np
def compute_returns(df, horizons=[1, 5, 10, 20]):
"""Compute forward returns PER TICKER — never mixing tickers."""
result = []
for ticker, grp in df.groupby('ticker'):
grp = grp.sort_index()
for h in horizons:
# FORWARD return: shift backward so we align with the prediction point
# At time t, ret5 = return from t to t+5
# This means ret5 is only known AFTER t+5 — use only as target, never as feature
grp[f'ret{h}'] = grp['close'].pct_change(h).shift(-h)
result.append(grp)
return pd.concat(result)
# Critical: forward returns are TARGETS, not features
# Never use future returns as input features
Step 3: Options Chain Data
Hourly options data is significantly richer and more complex than daily OHLCV. For a daily prediction model, we need to aggregate hourly options metrics into daily summaries.
Key options metrics to extract
| Metric | What it measures | Aggregation |
|---|---|---|
| iv_30d | 30-day implied volatility (market's forecast of future vol) | End-of-day value |
| iv_hv_ratio | IV / historical vol — premium or discount to realized vol | End-of-day value |
| put_call_ratio | Put volume / call volume — market sentiment proxy | Daily total |
| skew | 25-delta put IV minus 25-delta call IV — tail risk pricing | End-of-day value |
| term_structure_slope | Difference between front-month and back-month IV | End-of-day value |
def aggregate_options_daily(options_df):
"""Aggregate hourly options data to daily — using only data available at EOD."""
# Filter to 3:55 PM data to approximate EOD
eod = options_df[options_df['hour'] == 15]
daily = eod.groupby(['date', 'ticker']).agg({
'iv_30d': 'last',
'put_call_ratio': 'sum', # total daily volume ratio
'skew_25d': 'last',
'term_slope': 'last',
})
# Compute IV/HV ratio
# Historical vol: 20-day realized vol using close prices
# (computed from price data, not options data)
return daily
Handling the coverage gap
Options data only covers ~2 years while price data covers 20 years. There are two approaches:
- Exclude options features from the 20-year model — build price-only features for the long history, add options features as an additional layer for the more recent period.
- Fill missing options data with 0 or NaN — this is what we initially did, which introduced a problem: the model saw options features as "all zero" for 18 years and "meaningful values" for 2 years. The model never learned that zeros meant "missing" vs. "actual zero value."
has_sentiment mask feature would have helped the model distinguish missing from zero. Always handle missingness explicitly.Step 4: News Sentiment
News sentiment is the most complex data stream because it's event-driven — articles arrive at irregular intervals throughout the day, not at fixed timestamps.
From raw text to daily sentiment scores
We use pre-trained NLP models from HuggingFace to extract sentiment from financial news. The pipeline has three stages:
- Fetch raw news articles associated with each ticker from FinHub / GDELT for a given day
- Pass each article's headline through a financial sentiment model (e.g.,
ProsusAI/finbert,mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis) - Aggregate per-article scores into a single daily score per ticker
from transformers import pipeline
# Load model once, reuse
sentiment_model = pipeline(
"text-classification",
model="ProsusAI/finbert",
device=0 # GPU
)
def compute_daily_sentiment(articles: list[dict], ticker: str, date: str) -> float:
"""
articles: list of {'headline': str, 'published_at': datetime}
Returns: single float sentiment score for (ticker, date)
"""
if not articles:
return None # Explicit None, not 0 — missing is different from neutral
# CRITICAL: only use articles published BEFORE market close on 'date'
# Articles published after 4pm must be lagged to next trading day
cutoff = pd.Timestamp(f"{date} 16:00:00", tz='US/Eastern')
articles_before_close = [a for a in articles if a['published_at'] < cutoff]
...
The lag rule — non-negotiable
If news about AAPL is published at 6pm on Monday, you cannot use it as a feature for predicting Monday's return (the market has already closed). You must lag it to Tuesday.
sentiment_feature[t] = f(articles published before market close at t)
This seems obvious but is easy to get wrong when working with timestamp-naive data. Always enforce timestamps with explicit timezone handling.
Step 5: Merging Into the Master DataFrame
With all three sources prepared at daily granularity, we merge them on (date, ticker):
# All DataFrames indexed by (date, ticker) MultiIndex
price_df = ... # OHLCV + computed returns (targets separate)
options_df = ... # Aggregated daily options metrics
sentiment_df = ... # Daily sentiment scores per ticker
# Merge on inner join — only trading days where we have price data
df = price_df \
.join(options_df, how='left') \
.join(sentiment_df, how='left')
# Result: one row per (date, ticker), columns = all features + targets
# Options and sentiment may be NaN for older dates — handle explicitly
# Add missingness indicators
df['has_options'] = (~df['iv_30d'].isna()).astype(int)
df['has_sentiment'] = (~df['sentiment_score'].isna()).astype(int)
# Fill NaN in feature columns (never in target columns)
# Forward fill options data within a ticker (last known IV)
df['iv_30d'] = df.groupby('ticker')['iv_30d'].ffill()
# Sentiment NaN → 0 (truly neutral: no news)
df['sentiment_score'] = df['sentiment_score'].fillna(0.0)
Step 6: Train / Validation Split Without Leakage
For time series, the train/validation split is not random — it's chronological. You train on the past and validate on the future. Anything else constitutes look-ahead bias.
def chronological_split(df, train_ratio=0.8):
"""Split on TIME, not randomly."""
all_dates = df.index.get_level_values('date').unique().sort_values()
cutoff_idx = int(len(all_dates) * train_ratio)
cutoff_date = all_dates[cutoff_idx]
train_df = df[df.index.get_level_values('date') <= cutoff_date]
val_df = df[df.index.get_level_values('date') > cutoff_date]
print(f"Train: {train_df.index.get_level_values('date').min()} "
f"to {train_df.index.get_level_values('date').max()}")
print(f"Val: {val_df.index.get_level_values('date').min()} "
f"to {val_df.index.get_level_values('date').max()}")
return train_df, val_df
The purge gap For multi-horizon targets (e.g., ret20 = 20-day forward return), information from the training set can "bleed" into the validation set if their holding periods overlap. A purge gap of at least the maximum target horizon (21 days in this project) is inserted between train end and validation start to prevent this.
PURGE_DAYS = 21 # Greater than max horizon (20 days)
cutoff_date = all_dates[cutoff_idx]
purge_end = cutoff_date + pd.Timedelta(days=PURGE_DAYS)
train_df = df[df.index.get_level_values('date') <= cutoff_date]
val_df = df[df.index.get_level_values('date') >= purge_end]
# Rows between cutoff_date and purge_end are discarded
Step 7: Normalization
Raw features have wildly different scales: close prices in hundreds of dollars, volume in tens of millions, sentiment scores between -1 and +1. Feeding these directly into a neural network causes the model to effectively ignore small-magnitude features. Normalization fixes this.
Two normalization approaches
Critical: fit normalizer on training data only
The mean, standard deviation, min, and max used for normalization must be computed from the training set only. If you compute statistics on the full dataset (train + validation), you're using future information to normalize past data — a subtle form of look-ahead bias.
from sklearn.preprocessing import StandardScaler
feature_cols = ['close', 'volume', 'volatility', 'sentiment_score',
'iv_30d', 'iv_hv_ratio', 'rsi_14', ...]
# Fit ONLY on training data
scaler = StandardScaler()
train_df[feature_cols] = scaler.fit_transform(train_df[feature_cols])
# Transform validation using TRAINING statistics
val_df[feature_cols] = scaler.transform(val_df[feature_cols])
# Save scaler for inference time
import joblib
joblib.dump(scaler, 'models/feature_scaler.pkl')
The Sector Grouping Problem
The NASDAQ-100 contains companies from dramatically different industries: semiconductors (NVDA, AMD, QCOM), consumer staples (COST, PEP), software (MSFT, ORCL), and biotech (AMGN, GILD). These behave differently under the same macro conditions.
Training one model on all 103 tickers forces it to learn an average behavior that fits no sector well. This is part of why the Standard Transformer underperformed — it was trying to be NVDA and COST simultaneously.
SECTOR_MAP = {
# Semiconductors
'NVDA': 'semiconductors', 'AMD': 'semiconductors', 'QCOM': 'semiconductors',
'MRVL': 'semiconductors', 'INTC': 'semiconductors', 'AVGO': 'semiconductors',
# Consumer
'COST': 'consumer', 'PEP': 'consumer', 'SBUX': 'consumer',
'MDLZ': 'consumer', 'KHC': 'consumer',
# Software / Cloud
'MSFT': 'software', 'ORCL': 'software', 'CRM': 'software',
'ADBE': 'software', 'INTU': 'software', 'NOW': 'software',
# Large-cap tech
'AAPL': 'mega_cap', 'GOOGL': 'mega_cap', 'META': 'mega_cap',
'AMZN': 'mega_cap', 'TSLA': 'mega_cap',
# Biotech
'AMGN': 'biotech', 'GILD': 'biotech', 'REGN': 'biotech',
}
In Part 5, we'll train separate TFT models per sector and show why consumer names were the most learnable and semiconductors were the hardest.
Data Quality Checks
Before any modeling, run these checks on the assembled DataFrame:
- No NaN in target columns — rows with missing targets must be dropped, not filled. You can't fill in the "right answer."
- Ticker coverage consistency — every ticker should have data for every date in its active period. Gaps indicate delisting or data source issues.
- No future information in features — all features at time t must be computable using only data available at market close on day t.
- Ticker IDs are stable across runs — if you encode tickers as integers for embedding layers, the same ticker must get the same ID in every training run. Use an explicit, saved mapping.
- Return outliers — returns outside ±30% in a single period are suspicious and should be flagged (stock splits, data errors, or genuine market events the model can't generalize from).
def data_quality_report(df, target_cols=['ret5', 'ret10', 'ret20']):
print("=== Data Quality Report ===")
print(f"Shape: {df.shape}")
print(f"Date range: {df.index.get_level_values('date').min()} "
f"to {df.index.get_level_values('date').max()}")
print(f"Tickers: {df.index.get_level_values('ticker').nunique()}")
# NaN check in targets
for col in target_cols:
nan_pct = df[col].isna().mean()
print(f" {col} NaN: {nan_pct:.1%}")
# Return outlier check
for col in target_cols:
outlier_pct = (df[col].abs() > 0.30).mean()
print(f" {col} outliers (>30%): {outlier_pct:.1%}")
Summary: The Pipeline in 7 Steps
- Establish the NYSE trading calendar as the master time index
- Load and structure OHLCV data in long format, compute log returns per ticker
- Aggregate hourly options chain data to daily summaries, retain EOD snapshot
- Pass news headlines through FinBERT, aggregate to daily sentiment scores, enforce the pre-market-close lag rule
- Merge on (date, ticker), add missingness indicators, forward-fill where appropriate
- Chronological 80/20 train/validation split with 21-day purge gap
- Fit normalizer on training data only, apply to both splits
In Part 3, we build the feature engineering layer on top of this pipeline.