Data Pipeline: NASDAQ-100, Options Chain, and Sentiment

The Three Data Sources

This project uses three distinct data streams. Each has different frequency, different coverage, and different latency characteristics.

Source	Frequency	Coverage	Latency	Provider
NASDAQ-100 price/volume	Daily OHLCV	20 years	End-of-day	Interactive Brokers
Options chain	Hourly	~2 years	Real-time / 15min delay	Interactive Brokers
News sentiment	Event-driven → daily aggregated	~1-2 years	Near real-time	FinHub, GDELT, IB

The fundamental alignment problem: these sources have different granularities, different start dates, and different update patterns. Getting them into a single daily DataFrame per ticker is the foundation for everything that follows.

Step 1: Establish the Master Calendar

The first architectural decision is: what is the master time index? Every other data source gets aligned to this.

For NASDAQ-100 trading, the master index is the set of all US equity trading days — excluding weekends and market holidays. We derive this from the price data itself since that data is only available on trading days.

import pandas as pd
import pandas_market_calendars as mcal

# Generate master trading calendar
nyse = mcal.get_calendar('NYSE')
schedule = nyse.schedule(start_date='2004-01-01', end_date='2024-12-31')
trading_days = mcal.date_range(schedule, frequency='1D')

# Convert to date index (no time component for daily data)
master_index = pd.DatetimeIndex([d.date() for d in trading_days])
# Result: ~5,040 trading days over 20 years

Why this matters

Every feature, every sentiment score, every options metric must be indexed to this calendar. Missing dates become NaN and are forward-filled (with the most recent prior value) or imputed. Weekends and holidays in raw data get dropped. Any data that doesn't have a corresponding trading day is discarded.

Step 2: NASDAQ-100 Price Data

Daily OHLCV (Open, High, Low, Close, Volume) for the NASDAQ-100 universe is the most straightforward data stream. For this project, the universe is the current NASDAQ-100 constituents — 103 tickers including their options-active names.

Structuring the DataFrame

The cleanest structure for a multi-ticker time series system is a long-format DataFrame: one row per (date, ticker) pair. This makes cross-ticker operations, filtering, and merging straightforward.

# Long format: (date, ticker) → features
#
# date        ticker  open    high    low     close   volume
# 2024-01-02  AAPL    184.22  185.88  183.43  185.52  6.8e7
# 2024-01-02  MSFT    374.01  376.54  373.21  376.04  2.1e7
# 2024-01-02  NVDA    495.22  498.44  491.11  495.72  4.2e7

# Index: date + ticker as MultiIndex
df = df.set_index(['date', 'ticker'])

Computing daily returns

Raw price levels are non-stationary and should not be fed directly into a model. We compute log returns per ticker:

import numpy as np

def compute_returns(df, horizons=[1, 5, 10, 20]):
    """Compute forward returns PER TICKER — never mixing tickers."""
    result = []
    for ticker, grp in df.groupby('ticker'):
        grp = grp.sort_index()
        for h in horizons:
            # FORWARD return: shift backward so we align with the prediction point
            # At time t, ret5 = return from t to t+5
            # This means ret5 is only known AFTER t+5 — use only as target, never as feature
            grp[f'ret{h}'] = grp['close'].pct_change(h).shift(-h)
        result.append(grp)
    return pd.concat(result)

# Critical: forward returns are TARGETS, not features
# Never use future returns as input features

Leakage risk

Forward returns (ret5, ret10, ret20) are the targets — what you're trying to predict. They must never appear as input features. This sounds obvious, but it's one of the most common sources of look-ahead bias in ML pipelines. Always keep targets separate from features in your column schema.

Step 3: Options Chain Data

Hourly options data is significantly richer and more complex than daily OHLCV. For a daily prediction model, we need to aggregate hourly options metrics into daily summaries.

Key options metrics to extract

Metric	What it measures	Aggregation
iv_30d	30-day implied volatility (market's forecast of future vol)	End-of-day value
iv_hv_ratio	IV / historical vol — premium or discount to realized vol	End-of-day value
put_call_ratio	Put volume / call volume — market sentiment proxy	Daily total
skew	25-delta put IV minus 25-delta call IV — tail risk pricing	End-of-day value
term_structure_slope	Difference between front-month and back-month IV	End-of-day value

def aggregate_options_daily(options_df):
    """Aggregate hourly options data to daily — using only data available at EOD."""
    # Filter to 3:55 PM data to approximate EOD
    eod = options_df[options_df['hour'] == 15]

    daily = eod.groupby(['date', 'ticker']).agg({
        'iv_30d': 'last',
        'put_call_ratio': 'sum',  # total daily volume ratio
        'skew_25d': 'last',
        'term_slope': 'last',
    })

    # Compute IV/HV ratio
    # Historical vol: 20-day realized vol using close prices
    # (computed from price data, not options data)
    return daily

Handling the coverage gap

Options data only covers ~2 years while price data covers 20 years. There are two approaches:

Exclude options features from the 20-year model — build price-only features for the long history, add options features as an additional layer for the more recent period.
Fill missing options data with 0 or NaN — this is what we initially did, which introduced a problem: the model saw options features as "all zero" for 18 years and "meaningful values" for 2 years. The model never learned that zeros meant "missing" vs. "actual zero value."

Lesson learned

When we ran the Standard Transformer with 3 years of data and only 1 year of sentiment, the model saw zeros for the missing sentiment period. It failed to learn from sentiment at all — zeros looked like valid data. A binary has_sentiment mask feature would have helped the model distinguish missing from zero. Always handle missingness explicitly.

Step 4: News Sentiment

News sentiment is the most complex data stream because it's event-driven — articles arrive at irregular intervals throughout the day, not at fixed timestamps.

From raw text to daily sentiment scores

We use pre-trained NLP models from HuggingFace to extract sentiment from financial news. The pipeline has three stages:

Fetch raw news articles associated with each ticker from FinHub / GDELT for a given day
Pass each article's headline through a financial sentiment model (e.g., ProsusAI/finbert, mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis)
Aggregate per-article scores into a single daily score per ticker

from transformers import pipeline

# Load model once, reuse
sentiment_model = pipeline(
    "text-classification",
    model="ProsusAI/finbert",
    device=0  # GPU
)

def compute_daily_sentiment(articles: list[dict], ticker: str, date: str) -> float:
    """
    articles: list of {'headline': str, 'published_at': datetime}
    Returns: single float sentiment score for (ticker, date)
    """
    if not articles:
        return None  # Explicit None, not 0 — missing is different from neutral

    # CRITICAL: only use articles published BEFORE market close on 'date'
    # Articles published after 4pm must be lagged to next trading day
    cutoff = pd.Timestamp(f"{date} 16:00:00", tz='US/Eastern')
    articles_before_close = [a for a in articles if a['published_at'] < cutoff]
    ...

The lag rule — non-negotiable

If news about AAPL is published at 6pm on Monday, you cannot use it as a feature for predicting Monday's return (the market has already closed). You must lag it to Tuesday.

sentiment_feature[t] = f(articles published before market close at t)

This seems obvious but is easy to get wrong when working with timestamp-naive data. Always enforce timestamps with explicit timezone handling.

Step 5: Merging Into the Master DataFrame

With all three sources prepared at daily granularity, we merge them on (date, ticker):

# All DataFrames indexed by (date, ticker) MultiIndex
price_df     = ...  # OHLCV + computed returns (targets separate)
options_df   = ...  # Aggregated daily options metrics
sentiment_df = ...  # Daily sentiment scores per ticker

# Merge on inner join — only trading days where we have price data
df = price_df \
    .join(options_df, how='left') \
    .join(sentiment_df, how='left')

# Result: one row per (date, ticker), columns = all features + targets
# Options and sentiment may be NaN for older dates — handle explicitly

# Add missingness indicators
df['has_options']   = (~df['iv_30d'].isna()).astype(int)
df['has_sentiment'] = (~df['sentiment_score'].isna()).astype(int)

# Fill NaN in feature columns (never in target columns)
# Forward fill options data within a ticker (last known IV)
df['iv_30d'] = df.groupby('ticker')['iv_30d'].ffill()
# Sentiment NaN → 0 (truly neutral: no news)
df['sentiment_score'] = df['sentiment_score'].fillna(0.0)

Step 6: Train / Validation Split Without Leakage

For time series, the train/validation split is not random — it's chronological. You train on the past and validate on the future. Anything else constitutes look-ahead bias.

def chronological_split(df, train_ratio=0.8):
    """Split on TIME, not randomly."""
    all_dates = df.index.get_level_values('date').unique().sort_values()
    cutoff_idx = int(len(all_dates) * train_ratio)
    cutoff_date = all_dates[cutoff_idx]

    train_df = df[df.index.get_level_values('date') <= cutoff_date]
    val_df   = df[df.index.get_level_values('date') > cutoff_date]

    print(f"Train: {train_df.index.get_level_values('date').min()} "
          f"to {train_df.index.get_level_values('date').max()}")
    print(f"Val:   {val_df.index.get_level_values('date').min()} "
          f"to {val_df.index.get_level_values('date').max()}")

    return train_df, val_df

The purge gap For multi-horizon targets (e.g., ret20 = 20-day forward return), information from the training set can "bleed" into the validation set if their holding periods overlap. A purge gap of at least the maximum target horizon (21 days in this project) is inserted between train end and validation start to prevent this.

PURGE_DAYS = 21  # Greater than max horizon (20 days)

cutoff_date = all_dates[cutoff_idx]
purge_end   = cutoff_date + pd.Timedelta(days=PURGE_DAYS)

train_df = df[df.index.get_level_values('date') <= cutoff_date]
val_df   = df[df.index.get_level_values('date') >= purge_end]
# Rows between cutoff_date and purge_end are discarded

Step 7: Normalization

Raw features have wildly different scales: close prices in hundreds of dollars, volume in tens of millions, sentiment scores between -1 and +1. Feeding these directly into a neural network causes the model to effectively ignore small-magnitude features. Normalization fixes this.

Two normalization approaches

Z-Score Standardization

Subtract the mean, divide by standard deviation. Each feature now has mean ≈ 0 and std ≈ 1. Best for features you expect to be roughly Gaussian. Standard choice for transformer inputs.

Min-Max Scaling

Scale to [0, 1] range. Good for features with known bounds (like RSI: always 0–100). Sensitive to outliers.

Critical: fit normalizer on training data only

The mean, standard deviation, min, and max used for normalization must be computed from the training set only. If you compute statistics on the full dataset (train + validation), you're using future information to normalize past data — a subtle form of look-ahead bias.

from sklearn.preprocessing import StandardScaler

feature_cols = ['close', 'volume', 'volatility', 'sentiment_score',
                'iv_30d', 'iv_hv_ratio', 'rsi_14', ...]

# Fit ONLY on training data
scaler = StandardScaler()
train_df[feature_cols] = scaler.fit_transform(train_df[feature_cols])

# Transform validation using TRAINING statistics
val_df[feature_cols] = scaler.transform(val_df[feature_cols])

# Save scaler for inference time
import joblib
joblib.dump(scaler, 'models/feature_scaler.pkl')

The Sector Grouping Problem

The NASDAQ-100 contains companies from dramatically different industries: semiconductors (NVDA, AMD, QCOM), consumer staples (COST, PEP), software (MSFT, ORCL), and biotech (AMGN, GILD). These behave differently under the same macro conditions.

Training one model on all 103 tickers forces it to learn an average behavior that fits no sector well. This is part of why the Standard Transformer underperformed — it was trying to be NVDA and COST simultaneously.

SECTOR_MAP = {
    # Semiconductors
    'NVDA': 'semiconductors', 'AMD': 'semiconductors', 'QCOM': 'semiconductors',
    'MRVL': 'semiconductors', 'INTC': 'semiconductors', 'AVGO': 'semiconductors',
    # Consumer
    'COST': 'consumer', 'PEP': 'consumer', 'SBUX': 'consumer',
    'MDLZ': 'consumer', 'KHC': 'consumer',
    # Software / Cloud
    'MSFT': 'software', 'ORCL': 'software', 'CRM': 'software',
    'ADBE': 'software', 'INTU': 'software', 'NOW': 'software',
    # Large-cap tech
    'AAPL': 'mega_cap', 'GOOGL': 'mega_cap', 'META': 'mega_cap',
    'AMZN': 'mega_cap', 'TSLA': 'mega_cap',
    # Biotech
    'AMGN': 'biotech', 'GILD': 'biotech', 'REGN': 'biotech',
}

In Part 5, we'll train separate TFT models per sector and show why consumer names were the most learnable and semiconductors were the hardest.

Data Quality Checks

Before any modeling, run these checks on the assembled DataFrame:

No NaN in target columns — rows with missing targets must be dropped, not filled. You can't fill in the "right answer."
Ticker coverage consistency — every ticker should have data for every date in its active period. Gaps indicate delisting or data source issues.
No future information in features — all features at time t must be computable using only data available at market close on day t.
Ticker IDs are stable across runs — if you encode tickers as integers for embedding layers, the same ticker must get the same ID in every training run. Use an explicit, saved mapping.
Return outliers — returns outside ±30% in a single period are suspicious and should be flagged (stock splits, data errors, or genuine market events the model can't generalize from).

def data_quality_report(df, target_cols=['ret5', 'ret10', 'ret20']):
    print("=== Data Quality Report ===")
    print(f"Shape: {df.shape}")
    print(f"Date range: {df.index.get_level_values('date').min()} "
          f"to {df.index.get_level_values('date').max()}")
    print(f"Tickers: {df.index.get_level_values('ticker').nunique()}")

    # NaN check in targets
    for col in target_cols:
        nan_pct = df[col].isna().mean()
        print(f"  {col} NaN: {nan_pct:.1%}")

    # Return outlier check
    for col in target_cols:
        outlier_pct = (df[col].abs() > 0.30).mean()
        print(f"  {col} outliers (>30%): {outlier_pct:.1%}")

Summary: The Pipeline in 7 Steps

Establish the NYSE trading calendar as the master time index
Load and structure OHLCV data in long format, compute log returns per ticker
Aggregate hourly options chain data to daily summaries, retain EOD snapshot
Pass news headlines through FinBERT, aggregate to daily sentiment scores, enforce the pre-market-close lag rule
Merge on (date, ticker), add missingness indicators, forward-fill where appropriate
Chronological 80/20 train/validation split with 21-day purge gap
Fit normalizer on training data only, apply to both splits

In Part 3, we build the feature engineering layer on top of this pipeline.