ElasticNet Dev-Test-Prod Promotion Design
Purpose
This document maps the current Azure ElasticNet pilot into a simple model-promotion pipeline that is credible for enterprise-style delivery without overbuilding the first version.
The goal is not to build a full MLOps platform today. The goal is to establish:
- a clear candidate model lifecycle
- a low-cost dev/test/prod promotion pattern
- durable artifacts
- explicit approval and promotion steps
- a structure that can absorb stronger security, observability, and automation later
Security posture for this document
This design now includes the security controls that should exist in a financial application, but many of them are intentionally backlog items for the next phase.
The operating rule is:
- build the promotion pipeline first
- then close the security gaps in a controlled sequence
That means this document distinguishes between:
- implemented now
- required next
- backlog for production hardening
Minimal RMF / NIST framing
This project is not implementing a full formal compliance program yet, but the design should still be legible through a minimal risk-management lens.
For this document, use two reference frames:
-
NIST AI RMF 1.0
- Govern
- Map
- Measure
- Manage
-
NIST RMF / NIST SP 800-53 style controls
- used here as a lightweight control backlog and architecture checklist
This is not a formal accreditation package. It is a practical mapping so the project can evolve into a more defensible financial-model operating pattern later.
Minimal NIST AI RMF mapping for this project
Govern
What it means here:
- define who can train, validate, approve, and promote models
- define what evidence is required before promotion
- define what logs and audit records must exist
Current alignment:
- candidate / validation / approval / promotion states now exist
- promotion events are recorded
Backlog:
- formal role separation
- policy ownership
- exception process
- model-risk signoff
Map
What it means here:
- identify the model purpose
- identify the data sources
- identify the risks of bad outputs
- identify the environments where the model is allowed to operate
Current alignment:
- model scope is explicit
- source data is known
- dev/test/prod logical stages are defined
Backlog:
- explicit model risk classification
- documented misuse / abuse scenarios
- downstream business impact analysis
Measure
What it means here:
- evaluate model quality
- evaluate operational correctness
- evaluate security and abuse exposure
Current alignment:
- training metrics are recorded
- validation status exists
- artifact and promotion paths are explicit
Backlog:
- formal validation thresholds
- reproducibility checks
- adversarial and red-team testing
- drift and anomaly measurement
Manage
What it means here:
- control promotion
- control rollback
- monitor active versions
- respond to failures or misuse
Current alignment:
- production can be pinned to a deployed model version
- promotion records exist
Backlog:
- rollback runbook enforcement
- alerting on model-version drift
- incident response integration
Minimal NIST RMF lifecycle mapping
This project can be read through a simplified RMF lifecycle:
- Categorize
- classify the model and data as financially sensitive operational assets
- Select
- choose minimal controls appropriate for a pilot with future production intent
- Implement
- add Key Vault, managed identity, durable artifacts, approval states, and deployment mapping
- Assess
- validate pipeline behavior, security assumptions, and promotion controls
- Authorize
- explicitly approve model movement to test and prod
- Monitor
- add logging, drift monitoring, security monitoring, and periodic access review
This project is currently between Implement and early Assess.
Current State
Already implemented
- Azure ML workspace, compute, and environment
- Key Vault-backed secret resolution
- Blob-backed durable model artifacts
- staged AML jobs:
extractfeaturestrainscorepublish
- PostgreSQL-backed training and inference metadata
- standalone score path that can load a fixed registered training run
- basic secret hygiene through Key Vault + managed identity
Missing before this design pass
- explicit candidate model registration
- explicit validation status
- explicit approval status
- explicit deployment environment mapping
- a clean way for production scoring to use a fixed promoted model version
Design Principles
- Keep the Azure bill low.
- Keep the operator workflow simple.
- Separate training from deployment.
- Promote metadata first, not infrastructure first.
- Use the same codebase and the same core Azure resources for now.
- Delay heavy enterprise features until the promotion path is stable.
- Keep security requirements attached to each lifecycle stage so they are not forgotten.
- Prefer identity-based access over shared secrets wherever possible.
- Default to least privilege, immutability, and auditability.
Least-Cost Azure Resource Strategy
For now, do not create separate Azure stacks for dev, test, and prod.
Use one low-cost shared platform and separate environments logically through metadata and process:
- AML Workspace: one shared workspace
- AML Compute: one small CPU cluster with autoscale-to-zero
- Blob Storage: one shared pilot storage account with model/run prefixes
- PostgreSQL Flexible Server: one small server for metadata and batch results
- Key Vault: the workspace Key Vault
This is the least-cost approach because:
- no duplicate workspaces
- no duplicate compute clusters
- no duplicate Postgres servers
- no duplicate storage accounts
The tradeoff is that dev/test/prod are logical stages, not hard infrastructure isolation. That is acceptable for this stage of the project.
Security backlog for the low-cost shared-platform choice
Required later:
- separate Azure subscriptions or management groups for higher-trust environments
- environment-specific storage accounts and databases
- private networking for AML, Blob, Postgres, and Key Vault
- CMK-backed encryption where required by policy
- stricter RBAC separation between developers, approvers, and operators
Accepted for now:
- one shared Azure platform with logical environment separation
- public network access where already required for speed of setup
This is acceptable for the current stage, but not the final target state for a financial system.
Minimal control references to use in this design
The following NIST-style control families are the most relevant minimal set for this project.
These are not all implemented today. They are the reference set for backlog planning.
Access control
- AC-2 Account Management
- AC-3 Access Enforcement
- AC-6 Least Privilege
- AC-17 Remote Access
Use here for:
- AML workspace role separation
- Key Vault access restriction
- promotion-role separation
Audit and accountability
- AU-2 Event Logging
- AU-3 Content of Audit Records
- AU-6 Audit Record Review, Analysis, and Reporting
- AU-12 Audit Record Generation
Use here for:
- promotion events
- approval evidence
- scoring and publication trail
Configuration and change management
- CM-2 Baseline Configuration
- CM-3 Configuration Change Control
- CM-5 Access Restrictions for Change
- CM-8 System Component Inventory
Use here for:
- AML environment versioning
- job YAML control
- environment promotion discipline
Identification and authentication
- IA-2 Identification and Authentication
- IA-5 Authenticator Management
Use here for:
- operator identities
- secret rotation
- managed identity usage
System and communications protection
- SC-7 Boundary Protection
- SC-8 Transmission Confidentiality and Integrity
- SC-12 Cryptographic Key Establishment and Management
- SC-13 Cryptographic Protection
- SC-28 Protection of Information at Rest
Use here for:
- private endpoints
- TLS enforcement
- encryption at rest
- Key Vault-backed secret storage
System and information integrity
- SI-3 Malicious Code Protection
- SI-4 System Monitoring
- SI-7 Software, Firmware, and Information Integrity
- SI-10 Information Input Validation
Use here for:
- artifact integrity
- validation of job inputs and model metadata
- monitoring for unauthorized drift
Risk assessment
- RA-3 Risk Assessment
- RA-5 Vulnerability Monitoring and Scanning
Use here for:
- threat modeling
- dependency scanning
- periodic review of AML/storage/Postgres exposure
Supply chain / provenance
- SR-3 Supply Chain Controls and Processes
- SR-11 Component Authenticity
Use here for:
- image provenance
- package provenance
- signed artifact direction
Contingency / resilience
- CP-2 Contingency Plan
- CP-9 System Backup
- CP-10 System Recovery and Reconstitution
Use here for:
- rollback plan
- recovery from bad model promotion
- durable artifact recovery
Minimal Promotion Architecture
Training system of record
ndx_az_elasticnet.training_runs
This stores:
- model name
- feature set
- horizon
- training metrics
- parameters
- durable
model_uri
Security expectations for this layer:
- all training records should be append-only from an operator perspective
- mutable status changes should be auditable
- no secrets should ever be stored in metrics or parameter JSON
- model URIs should point only to approved internal storage locations
Model registry
ndx_az_elasticnet.model_versions
This is the promotion-facing registry layer on top of training_runs.
Each row represents one registered candidate model version tied to exactly one successful training run.
It stores:
model_version_idtraining_run_idmodel_uri- metrics and parameters
- source commit
- training data window
candidate_statusvalidation_statusapproval_status
Security expectations for this layer:
- promotion status changes must be attributable to a principal or workflow
- approval should eventually require a distinct role from the trainer role
- registry records should be tamper-evident through audit logs
- model metadata should include source commit and training-data window for traceability
Deployment map
ndx_az_elasticnet.model_deployments
This maps one active model version to one logical environment:
devtestprod
This lets production scoring use a fixed approved version instead of “latest successful training run”.
Security expectations for this layer:
- production deployment mapping should only be changeable by a restricted release role
- environment aliases should not silently advance to a new model without an explicit promotion event
- prod scoring should refuse unapproved models
Promotion audit trail
ndx_az_elasticnet.model_promotion_events
This records:
- candidate registration
- validation decisions
- approval decisions
- environment promotions
Security expectations for this layer:
- every state transition must be recorded
- operator notes should be mandatory in stricter environments
- audit records should be exported to a longer-retention sink later
Lifecycle
1. Develop in dev
- feature code changes
- training code changes
- scoring code changes
- AML packaging changes
This is standard code development.
Security backlog for development:
- mandatory branch protection
- signed commits or equivalent commit provenance
- secret scanning in PRs
- dependency scanning and CVE policy
- SAST for Python and shell
- IaC scanning for AML, storage, Key Vault, and Postgres configuration
- coding standards for secure SQL, secure serialization, and input handling
- red-team-inspired abuse cases for:
- secret leakage
- artifact tampering
- promotion bypass
- poisoned training data assumptions
Relevant references:
- CM-3, CM-5
- RA-5
- SR-11
2. Merge after CI passes
CI should eventually include:
- unit tests
- linting
- packaging validation
- shell script validation
Current repo already supports the test and packaging checks needed to start.
Security backlog for CI:
- SCA/dependency scanning
- SBOM generation
- container image vulnerability scanning
- reproducible build metadata
- policy checks that block use of plaintext secrets
- test that production scoring only accepts promoted models
- checks that AML job definitions do not take DB secrets as direct inputs
Relevant references:
- AU-2, AU-12
- CM-2
- RA-5
- SR-3
3. Train candidate model
Run the staged AML batch or the relevant train flow:
- extract
- features
- train
Training writes:
- durable model pickle to Blob
- durable metrics JSON to Blob
- training metadata to
training_runs
Security backlog for training and data handling:
- source data integrity validation
- schema drift detection before feature generation
- training-data lineage capture with hashes or dataset version IDs
- access control that restricts who can trigger training in higher environments
- outbound network restriction for AML jobs where practical
- malware / artifact scanning for persisted model files
- model serialization review to reduce unsafe pickle risk over time
Financial-data controls required later:
- confirm retention policy for copied market data
- document permitted source datasets
- document whether any licensed data has redistribution restrictions
- ensure training artifacts never embed raw credentials or private operational data
Relevant references:
- SC-8, SC-28
- SI-7, SI-10
- RA-3
4. Register candidate model
Use:
scripts/register_candidate_model.py
This creates a row in model_versions from a successful training run.
The candidate starts as:
candidate_status = candidatevalidation_status = pendingapproval_status = pending
Security backlog for candidate registration:
- only successful training runs should be registerable
- candidate registration should capture the acting principal
- candidate registration should eventually require policy checks on:
- metrics completeness
- source commit presence
- model URI location
Relevant references:
- AU-3, AU-12
- CM-3
5. Run validation pipeline
For now, validation is a controlled operator step backed by metadata.
Use:
scripts/validate_candidate_model.py
This marks the model version:
passed- or
failed
Later this can be replaced by a real validation job that checks:
- metric thresholds
- schema compatibility
- reproducibility
- inference smoke tests
Security backlog for validation:
- enforce validation thresholds by policy rather than operator judgment alone
- add adversarial / abuse-oriented validation:
- malformed artifact path
- missing or tampered blob artifact
- stale or mismatched model metadata
- unexpected schema changes
- add red-team scenarios:
- can a non-approved model be injected into test or prod
- can a manipulated model URI point outside approved storage
- can recommendation publishing be triggered from an unvalidated run
Relevant references:
- SI-7
- RA-3
- AU-6
6. Promote to test
Use:
scripts/promote_model.py --target-environment test
Only validated candidates should move to test.
Test is where you perform:
- integration checks
- shadow/batch checks
- production-like scoring verification
Security backlog for test promotion:
- require a release principal distinct from the training principal
- verify Key Vault access still follows least privilege
- verify environment-scoped scoring uses the deployment mapping, not the latest run shortcut
- verify logs and audit trail are complete before allowing prod promotion
Relevant references:
- AC-3, AC-6
- AU-2, AU-6
- CM-5
7. Approve model
Use:
scripts/approve_model.py
This is the explicit gate between:
- technically validated
- and organizationally approved
Security backlog for approval:
- implement maker-checker separation
- require human approval in prod path
- integrate approval evidence with ticket/change-management reference
- require attestation that validation artifacts were reviewed
Relevant references:
- AC-2, AC-6
- AU-3
- CM-3
8. Promote to prod
Use:
scripts/promote_model.py --target-environment prod
Only approved models should reach prod.
Security backlog for prod promotion:
- require stronger RBAC and ideally PIM/JIT elevation
- require immutable deployment event records
- require rollback readiness before cutover
- require release checklist covering:
- model version
- validation evidence
- approval evidence
- rollback target
Relevant references:
- CP-2, CP-10
- CM-3
- AU-12
9. Production scoring uses promoted version
Production scoring should resolve the model by deployment environment:
--deployment-environment prod
That keeps production inference fixed to the approved version.
Security backlog for production scoring:
- run with least-privilege identity
- prevent write access to registry tables from the scoring identity
- restrict production scoring from training-related secrets
- produce immutable inference logs and recommendation publication logs
- monitor for unexpected model-version drift
Relevant references:
- AC-6
- SC-7, SC-8
- SI-4, SI-7
Scripts Added
Candidate registration
mmaindx_az_elasticnet/scripts/register_candidate_model.py
Validation decision
mmaindx_az_elasticnet/scripts/validate_candidate_model.py
Approval decision
mmaindx_az_elasticnet/scripts/approve_model.py
Promotion
mmaindx_az_elasticnet/scripts/promote_model.py
Operator Commands
Register latest successful training run as candidate
make register-candidate
Validate a candidate
make validate-candidate MODEL_VERSION_ID=<uuid>
Approve a validated candidate
make approve-model MODEL_VERSION_ID=<uuid>
Promote to test
make promote-model MODEL_VERSION_ID=<uuid> DEPLOYMENT_ENV=test
Promote to prod
make promote-model MODEL_VERSION_ID=<uuid> DEPLOYMENT_ENV=prod
Score by promoted environment
uv run python mmaindx_az_elasticnet/scripts/score_latest.py \
--as-of-date 2026-06-05 \
--deployment-environment prod
Security Backlog by Domain
Identity and access
- separate roles for developer, trainer, validator, approver, and prod operator
- prefer managed identities over shared credentials everywhere
- move toward JIT/PIM for prod-changing actions
Secrets management
- eliminate remaining direct DB URLs from local operator flows where practical
- rotate source and target DB secrets regularly
- add secret expiry/rotation policy and ownership
Network security
- move AML, Blob, Postgres, and Key Vault to private endpoints
- restrict public ingress and egress
- add firewall allowlists and DNS design
Supply chain security
- signed images
- SBOMs
- image provenance
- dependency pinning with vulnerability policy
Data security
- data classification for market, operational, and derived model data
- retention and deletion policy
- encryption at rest and in transit review
- artifact integrity verification
Logging and monitoring
- centralized audit logging
- promotion-event monitoring
- failed-auth and privilege-escalation monitoring
- anomaly detection on unexpected production model changes
Red teaming and resilience
Backlog red-team exercises should include:
- stealing or replaying model artifacts
- tampering with deployment mappings
- bypassing approval to reach prod
- poisoning copied training data
- abusing AML job definitions to exfiltrate secrets
- publishing recommendations from an unapproved model
Compliance and governance
- change-management linkage for prod promotion
- evidence retention for approvals and validation
- documented operating procedures for rollback
- periodic access review
Security priorities by phase
Acceptable for current pilot
- shared Azure platform with logical environment separation
- managed identity + Key Vault for core secrets
- durable model artifacts
- audit trail for promotion metadata
Required next
- branch protection and CI scanning
- candidate / validator / approver role separation
- formal validation criteria
- production scoring only through promoted environment mapping
- secret rotation plan
Primary references:
- AC-6
- AU-12
- CM-3
- RA-5
Required before real production use
- private networking for AML / Blob / Postgres / Key Vault
- stronger RBAC and JIT/PIM for promotion roles
- centralized audit logging and alerting
- immutable release evidence and rollback procedure
- supply-chain controls for images and dependencies
- formal red-team / adversarial testing
- periodic access review and governance workflow
Primary references:
- AC-3
- AC-6
- AU-2
- AU-6
- SC-7
- SC-28
- SI-4
- SI-7
- RA-3
- RA-5
- SR-3
- CP-10
Mapping: What Exists vs What Is Missing
Exists now
- durable model storage
- training metadata
- inference metadata
- candidate registration
- validation status
- approval status
- environment deployment mapping
- environment-based scoring
- basic security foundation through Key Vault + managed identity
Still missing
- real automated validation policy engine
- AML-native pipeline object
- separate infrastructure per environment
- formal model registry service
- production monitoring and alerting
- policy-as-code for approval controls
- maker-checker approval enforcement
- private networking and environment isolation
- centralized security logging and alerting
- red-team validation and adversarial testing
- supply-chain security controls
Recommended Next Phase
Do not jump to a heavy enterprise platform yet.
The next sensible steps are:
- Run one full staged AML batch with the new promotion model.
- Register the resulting training run as a candidate.
- Validate and promote it to
test. - Run score in
test. - Approve and promote to
prod. - Make production scoring use
--deployment-environment prod.
After that, add:
- AML pipeline-job orchestration
- validation job automation
- environment-specific configuration overlays
- stronger audit/security controls
- threat modeling workshop
- red-team backlog execution
- private network migration
- least-privilege role split for promotion operations
Why this is the right level of architecture now
This design gives you:
- versioned trained models
- explicit promotion states
- explicit deployment targets
- low Azure cost
- clear operator workflow
without forcing:
- multiple workspaces
- multiple databases
- a full external registry service
- a large MLOps platform upfront
This is the correct minimum viable production pattern for the current project stage.