ElasticNet Dev-Test-Prod Promotion Design

Purpose

This document maps the current Azure ElasticNet pilot into a simple model-promotion pipeline that is credible for enterprise-style delivery without overbuilding the first version.

The goal is not to build a full MLOps platform today. The goal is to establish:

a clear candidate model lifecycle
a low-cost dev/test/prod promotion pattern
durable artifacts
explicit approval and promotion steps
a structure that can absorb stronger security, observability, and automation later

Security posture for this document

This design now includes the security controls that should exist in a financial application, but many of them are intentionally backlog items for the next phase.

The operating rule is:

build the promotion pipeline first
then close the security gaps in a controlled sequence

That means this document distinguishes between:

implemented now
required next
backlog for production hardening

Minimal RMF / NIST framing

This project is not implementing a full formal compliance program yet, but the design should still be legible through a minimal risk-management lens.

For this document, use two reference frames:

NIST AI RMF 1.0
- Govern
- Map
- Measure
- Manage
NIST RMF / NIST SP 800-53 style controls
- used here as a lightweight control backlog and architecture checklist

This is not a formal accreditation package. It is a practical mapping so the project can evolve into a more defensible financial-model operating pattern later.

Minimal NIST AI RMF mapping for this project

Govern

What it means here:

define who can train, validate, approve, and promote models
define what evidence is required before promotion
define what logs and audit records must exist

Current alignment:

candidate / validation / approval / promotion states now exist
promotion events are recorded

Backlog:

formal role separation
policy ownership
exception process
model-risk signoff

Map

What it means here:

identify the model purpose
identify the data sources
identify the risks of bad outputs
identify the environments where the model is allowed to operate

Current alignment:

model scope is explicit
source data is known
dev/test/prod logical stages are defined

Backlog:

explicit model risk classification
documented misuse / abuse scenarios
downstream business impact analysis

Measure

What it means here:

evaluate model quality
evaluate operational correctness
evaluate security and abuse exposure

Current alignment:

training metrics are recorded
validation status exists
artifact and promotion paths are explicit

Backlog:

formal validation thresholds
reproducibility checks
adversarial and red-team testing
drift and anomaly measurement

Manage

What it means here:

control promotion
control rollback
monitor active versions
respond to failures or misuse

Current alignment:

production can be pinned to a deployed model version
promotion records exist

Backlog:

rollback runbook enforcement
alerting on model-version drift
incident response integration

Minimal NIST RMF lifecycle mapping

This project can be read through a simplified RMF lifecycle:

Categorize
- classify the model and data as financially sensitive operational assets
Select
- choose minimal controls appropriate for a pilot with future production intent
Implement
- add Key Vault, managed identity, durable artifacts, approval states, and deployment mapping
Assess
- validate pipeline behavior, security assumptions, and promotion controls
Authorize
- explicitly approve model movement to test and prod
Monitor
- add logging, drift monitoring, security monitoring, and periodic access review

This project is currently between Implement and early Assess.

Current State

Already implemented

Azure ML workspace, compute, and environment
Key Vault-backed secret resolution
Blob-backed durable model artifacts
staged AML jobs:
- extract
- features
- train
- score
- publish
PostgreSQL-backed training and inference metadata
standalone score path that can load a fixed registered training run
basic secret hygiene through Key Vault + managed identity

Missing before this design pass

explicit candidate model registration
explicit validation status
explicit approval status
explicit deployment environment mapping
a clean way for production scoring to use a fixed promoted model version

Design Principles

Keep the Azure bill low.
Keep the operator workflow simple.
Separate training from deployment.
Promote metadata first, not infrastructure first.
Use the same codebase and the same core Azure resources for now.
Delay heavy enterprise features until the promotion path is stable.
Keep security requirements attached to each lifecycle stage so they are not forgotten.
Prefer identity-based access over shared secrets wherever possible.
Default to least privilege, immutability, and auditability.

Least-Cost Azure Resource Strategy

For now, do not create separate Azure stacks for dev, test, and prod.

Use one low-cost shared platform and separate environments logically through metadata and process:

AML Workspace: one shared workspace
AML Compute: one small CPU cluster with autoscale-to-zero
Blob Storage: one shared pilot storage account with model/run prefixes
PostgreSQL Flexible Server: one small server for metadata and batch results
Key Vault: the workspace Key Vault

This is the least-cost approach because:

no duplicate workspaces
no duplicate compute clusters
no duplicate Postgres servers
no duplicate storage accounts

The tradeoff is that dev/test/prod are logical stages, not hard infrastructure isolation. That is acceptable for this stage of the project.

Security backlog for the low-cost shared-platform choice

Required later:

separate Azure subscriptions or management groups for higher-trust environments
environment-specific storage accounts and databases
private networking for AML, Blob, Postgres, and Key Vault
CMK-backed encryption where required by policy
stricter RBAC separation between developers, approvers, and operators

Accepted for now:

one shared Azure platform with logical environment separation
public network access where already required for speed of setup

This is acceptable for the current stage, but not the final target state for a financial system.

Minimal control references to use in this design

The following NIST-style control families are the most relevant minimal set for this project.

These are not all implemented today. They are the reference set for backlog planning.

Access control

AC-2 Account Management
AC-3 Access Enforcement
AC-6 Least Privilege
AC-17 Remote Access

Use here for:

AML workspace role separation
Key Vault access restriction
promotion-role separation

Audit and accountability

AU-2 Event Logging
AU-3 Content of Audit Records
AU-6 Audit Record Review, Analysis, and Reporting
AU-12 Audit Record Generation

Use here for:

promotion events
approval evidence
scoring and publication trail

Configuration and change management

CM-2 Baseline Configuration
CM-3 Configuration Change Control
CM-5 Access Restrictions for Change
CM-8 System Component Inventory

Use here for:

AML environment versioning
job YAML control
environment promotion discipline

Identification and authentication

IA-2 Identification and Authentication
IA-5 Authenticator Management

Use here for:

operator identities
secret rotation
managed identity usage

System and communications protection

SC-7 Boundary Protection
SC-8 Transmission Confidentiality and Integrity
SC-12 Cryptographic Key Establishment and Management
SC-13 Cryptographic Protection
SC-28 Protection of Information at Rest

Use here for:

private endpoints
TLS enforcement
encryption at rest
Key Vault-backed secret storage

System and information integrity

SI-3 Malicious Code Protection
SI-4 System Monitoring
SI-7 Software, Firmware, and Information Integrity
SI-10 Information Input Validation

Use here for:

artifact integrity
validation of job inputs and model metadata
monitoring for unauthorized drift

Risk assessment

RA-3 Risk Assessment
RA-5 Vulnerability Monitoring and Scanning

Use here for:

threat modeling
dependency scanning
periodic review of AML/storage/Postgres exposure

Supply chain / provenance

SR-3 Supply Chain Controls and Processes
SR-11 Component Authenticity

Use here for:

image provenance
package provenance
signed artifact direction

Contingency / resilience

CP-2 Contingency Plan
CP-9 System Backup
CP-10 System Recovery and Reconstitution

Use here for:

rollback plan
recovery from bad model promotion
durable artifact recovery

Minimal Promotion Architecture

Training system of record

ndx_az_elasticnet.training_runs

This stores:

model name
feature set
horizon
training metrics
parameters
durable model_uri

Security expectations for this layer:

all training records should be append-only from an operator perspective
mutable status changes should be auditable
no secrets should ever be stored in metrics or parameter JSON
model URIs should point only to approved internal storage locations

Model registry

ndx_az_elasticnet.model_versions

This is the promotion-facing registry layer on top of training_runs.

Each row represents one registered candidate model version tied to exactly one successful training run.

It stores:

model_version_id
training_run_id
model_uri
metrics and parameters
source commit
training data window
candidate_status
validation_status
approval_status

Security expectations for this layer:

promotion status changes must be attributable to a principal or workflow
approval should eventually require a distinct role from the trainer role
registry records should be tamper-evident through audit logs
model metadata should include source commit and training-data window for traceability

Deployment map

ndx_az_elasticnet.model_deployments

This maps one active model version to one logical environment:

dev
test
prod

This lets production scoring use a fixed approved version instead of “latest successful training run”.

Security expectations for this layer:

production deployment mapping should only be changeable by a restricted release role
environment aliases should not silently advance to a new model without an explicit promotion event
prod scoring should refuse unapproved models

Promotion audit trail

ndx_az_elasticnet.model_promotion_events

This records:

candidate registration
validation decisions
approval decisions
environment promotions

Security expectations for this layer:

every state transition must be recorded
operator notes should be mandatory in stricter environments
audit records should be exported to a longer-retention sink later

Lifecycle

1. Develop in dev

feature code changes
training code changes
scoring code changes
AML packaging changes

This is standard code development.

Security backlog for development:

mandatory branch protection
signed commits or equivalent commit provenance
secret scanning in PRs
dependency scanning and CVE policy
SAST for Python and shell
IaC scanning for AML, storage, Key Vault, and Postgres configuration
coding standards for secure SQL, secure serialization, and input handling
red-team-inspired abuse cases for:
- secret leakage
- artifact tampering
- promotion bypass
- poisoned training data assumptions

Relevant references:

CM-3, CM-5
RA-5
SR-11

2. Merge after CI passes

CI should eventually include:

unit tests
linting
packaging validation
shell script validation

Current repo already supports the test and packaging checks needed to start.

Security backlog for CI:

SCA/dependency scanning
SBOM generation
container image vulnerability scanning
reproducible build metadata
policy checks that block use of plaintext secrets
test that production scoring only accepts promoted models
checks that AML job definitions do not take DB secrets as direct inputs

Relevant references:

AU-2, AU-12
CM-2
RA-5
SR-3

3. Train candidate model

Run the staged AML batch or the relevant train flow:

extract
features
train

Training writes:

durable model pickle to Blob
durable metrics JSON to Blob
training metadata to training_runs

Security backlog for training and data handling:

source data integrity validation
schema drift detection before feature generation
training-data lineage capture with hashes or dataset version IDs
access control that restricts who can trigger training in higher environments
outbound network restriction for AML jobs where practical
malware / artifact scanning for persisted model files
model serialization review to reduce unsafe pickle risk over time

Financial-data controls required later:

confirm retention policy for copied market data
document permitted source datasets
document whether any licensed data has redistribution restrictions
ensure training artifacts never embed raw credentials or private operational data

Relevant references:

SC-8, SC-28
SI-7, SI-10
RA-3

4. Register candidate model

Use:

scripts/register_candidate_model.py

This creates a row in model_versions from a successful training run.

The candidate starts as:

candidate_status = candidate
validation_status = pending
approval_status = pending

Security backlog for candidate registration:

only successful training runs should be registerable
candidate registration should capture the acting principal
candidate registration should eventually require policy checks on:
- metrics completeness
- source commit presence
- model URI location

Relevant references:

AU-3, AU-12
CM-3

5. Run validation pipeline

For now, validation is a controlled operator step backed by metadata.

Use:

scripts/validate_candidate_model.py

This marks the model version:

passed
or failed

Later this can be replaced by a real validation job that checks:

metric thresholds
schema compatibility
reproducibility
inference smoke tests

Security backlog for validation:

enforce validation thresholds by policy rather than operator judgment alone
add adversarial / abuse-oriented validation:
- malformed artifact path
- missing or tampered blob artifact
- stale or mismatched model metadata
- unexpected schema changes
add red-team scenarios:
- can a non-approved model be injected into test or prod
- can a manipulated model URI point outside approved storage
- can recommendation publishing be triggered from an unvalidated run

Relevant references:

SI-7
RA-3
AU-6

6. Promote to test

Use:

scripts/promote_model.py --target-environment test

Only validated candidates should move to test.

Test is where you perform:

integration checks
shadow/batch checks
production-like scoring verification

Security backlog for test promotion:

require a release principal distinct from the training principal
verify Key Vault access still follows least privilege
verify environment-scoped scoring uses the deployment mapping, not the latest run shortcut
verify logs and audit trail are complete before allowing prod promotion

Relevant references:

AC-3, AC-6
AU-2, AU-6
CM-5

7. Approve model

Use:

scripts/approve_model.py

This is the explicit gate between:

technically validated
and organizationally approved

Security backlog for approval:

implement maker-checker separation
require human approval in prod path
integrate approval evidence with ticket/change-management reference
require attestation that validation artifacts were reviewed

Relevant references:

AC-2, AC-6
AU-3
CM-3

8. Promote to prod

Use:

scripts/promote_model.py --target-environment prod

Only approved models should reach prod.

Security backlog for prod promotion:

require stronger RBAC and ideally PIM/JIT elevation
require immutable deployment event records
require rollback readiness before cutover
require release checklist covering:
- model version
- validation evidence
- approval evidence
- rollback target

Relevant references:

CP-2, CP-10
CM-3
AU-12

9. Production scoring uses promoted version

Production scoring should resolve the model by deployment environment:

--deployment-environment prod

That keeps production inference fixed to the approved version.

Security backlog for production scoring:

run with least-privilege identity
prevent write access to registry tables from the scoring identity
restrict production scoring from training-related secrets
produce immutable inference logs and recommendation publication logs
monitor for unexpected model-version drift

Relevant references:

AC-6
SC-7, SC-8
SI-4, SI-7

Scripts Added

Candidate registration

mmaindx_az_elasticnet/scripts/register_candidate_model.py

Validation decision

mmaindx_az_elasticnet/scripts/validate_candidate_model.py

Approval decision

mmaindx_az_elasticnet/scripts/approve_model.py

Promotion

mmaindx_az_elasticnet/scripts/promote_model.py

Operator Commands

Register latest successful training run as candidate

make register-candidate

Validate a candidate

make validate-candidate MODEL_VERSION_ID=<uuid>

Approve a validated candidate

make approve-model MODEL_VERSION_ID=<uuid>

Promote to test

make promote-model MODEL_VERSION_ID=<uuid> DEPLOYMENT_ENV=test

Promote to prod

make promote-model MODEL_VERSION_ID=<uuid> DEPLOYMENT_ENV=prod

Score by promoted environment

uv run python mmaindx_az_elasticnet/scripts/score_latest.py \
  --as-of-date 2026-06-05 \
  --deployment-environment prod

Security Backlog by Domain

Identity and access

separate roles for developer, trainer, validator, approver, and prod operator
prefer managed identities over shared credentials everywhere
move toward JIT/PIM for prod-changing actions

Secrets management

eliminate remaining direct DB URLs from local operator flows where practical
rotate source and target DB secrets regularly
add secret expiry/rotation policy and ownership

Network security

move AML, Blob, Postgres, and Key Vault to private endpoints
restrict public ingress and egress
add firewall allowlists and DNS design

Supply chain security

signed images
SBOMs
image provenance
dependency pinning with vulnerability policy

Data security

data classification for market, operational, and derived model data
retention and deletion policy
encryption at rest and in transit review
artifact integrity verification

Logging and monitoring

centralized audit logging
promotion-event monitoring
failed-auth and privilege-escalation monitoring
anomaly detection on unexpected production model changes

Red teaming and resilience

Backlog red-team exercises should include:

stealing or replaying model artifacts
tampering with deployment mappings
bypassing approval to reach prod
poisoning copied training data
abusing AML job definitions to exfiltrate secrets
publishing recommendations from an unapproved model

Compliance and governance

change-management linkage for prod promotion
evidence retention for approvals and validation
documented operating procedures for rollback
periodic access review

Security priorities by phase

Acceptable for current pilot

shared Azure platform with logical environment separation
managed identity + Key Vault for core secrets
durable model artifacts
audit trail for promotion metadata

Required next

branch protection and CI scanning
candidate / validator / approver role separation
formal validation criteria
production scoring only through promoted environment mapping
secret rotation plan

Primary references:

AC-6
AU-12
CM-3
RA-5

Required before real production use

private networking for AML / Blob / Postgres / Key Vault
stronger RBAC and JIT/PIM for promotion roles
centralized audit logging and alerting
immutable release evidence and rollback procedure
supply-chain controls for images and dependencies
formal red-team / adversarial testing
periodic access review and governance workflow

Primary references:

AC-3
AC-6
AU-2
AU-6
SC-7
SC-28
SI-4
SI-7
RA-3
RA-5
SR-3
CP-10

Mapping: What Exists vs What Is Missing

Exists now

durable model storage
training metadata
inference metadata
candidate registration
validation status
approval status
environment deployment mapping
environment-based scoring
basic security foundation through Key Vault + managed identity

Still missing

real automated validation policy engine
AML-native pipeline object
separate infrastructure per environment
formal model registry service
production monitoring and alerting
policy-as-code for approval controls
maker-checker approval enforcement
private networking and environment isolation
centralized security logging and alerting
red-team validation and adversarial testing
supply-chain security controls

Recommended Next Phase

Do not jump to a heavy enterprise platform yet.

The next sensible steps are:

Run one full staged AML batch with the new promotion model.
Register the resulting training run as a candidate.
Validate and promote it to test.
Run score in test.
Approve and promote to prod.
Make production scoring use --deployment-environment prod.

After that, add:

AML pipeline-job orchestration
validation job automation
environment-specific configuration overlays
stronger audit/security controls
threat modeling workshop
red-team backlog execution
private network migration
least-privilege role split for promotion operations

Why this is the right level of architecture now

This design gives you:

versioned trained models
explicit promotion states
explicit deployment targets
low Azure cost
clear operator workflow

without forcing:

multiple workspaces
multiple databases
a full external registry service
a large MLOps platform upfront

This is the correct minimum viable production pattern for the current project stage.