Back to Blog
technical-referenceazuredevopsci-cdpromotionstagingterraform

ElasticNet Dev-Test-Prod Promotion Design

Dev-test-prod promotion pipeline design for Azure ML workloads: environment isolation, artifact promotion, gate criteria, and rollback strategy.

December 5, 2025·14 min read

ElasticNet Dev-Test-Prod Promotion Design

Purpose

This document maps the current Azure ElasticNet pilot into a simple model-promotion pipeline that is credible for enterprise-style delivery without overbuilding the first version.

The goal is not to build a full MLOps platform today. The goal is to establish:

  • a clear candidate model lifecycle
  • a low-cost dev/test/prod promotion pattern
  • durable artifacts
  • explicit approval and promotion steps
  • a structure that can absorb stronger security, observability, and automation later

Security posture for this document

This design now includes the security controls that should exist in a financial application, but many of them are intentionally backlog items for the next phase.

The operating rule is:

  • build the promotion pipeline first
  • then close the security gaps in a controlled sequence

That means this document distinguishes between:

  • implemented now
  • required next
  • backlog for production hardening

Minimal RMF / NIST framing

This project is not implementing a full formal compliance program yet, but the design should still be legible through a minimal risk-management lens.

For this document, use two reference frames:

  1. NIST AI RMF 1.0

    • Govern
    • Map
    • Measure
    • Manage
  2. NIST RMF / NIST SP 800-53 style controls

    • used here as a lightweight control backlog and architecture checklist

This is not a formal accreditation package. It is a practical mapping so the project can evolve into a more defensible financial-model operating pattern later.

Minimal NIST AI RMF mapping for this project

Govern

What it means here:

  • define who can train, validate, approve, and promote models
  • define what evidence is required before promotion
  • define what logs and audit records must exist

Current alignment:

  • candidate / validation / approval / promotion states now exist
  • promotion events are recorded

Backlog:

  • formal role separation
  • policy ownership
  • exception process
  • model-risk signoff

Map

What it means here:

  • identify the model purpose
  • identify the data sources
  • identify the risks of bad outputs
  • identify the environments where the model is allowed to operate

Current alignment:

  • model scope is explicit
  • source data is known
  • dev/test/prod logical stages are defined

Backlog:

  • explicit model risk classification
  • documented misuse / abuse scenarios
  • downstream business impact analysis

Measure

What it means here:

  • evaluate model quality
  • evaluate operational correctness
  • evaluate security and abuse exposure

Current alignment:

  • training metrics are recorded
  • validation status exists
  • artifact and promotion paths are explicit

Backlog:

  • formal validation thresholds
  • reproducibility checks
  • adversarial and red-team testing
  • drift and anomaly measurement

Manage

What it means here:

  • control promotion
  • control rollback
  • monitor active versions
  • respond to failures or misuse

Current alignment:

  • production can be pinned to a deployed model version
  • promotion records exist

Backlog:

  • rollback runbook enforcement
  • alerting on model-version drift
  • incident response integration

Minimal NIST RMF lifecycle mapping

This project can be read through a simplified RMF lifecycle:

  1. Categorize
    • classify the model and data as financially sensitive operational assets
  2. Select
    • choose minimal controls appropriate for a pilot with future production intent
  3. Implement
    • add Key Vault, managed identity, durable artifacts, approval states, and deployment mapping
  4. Assess
    • validate pipeline behavior, security assumptions, and promotion controls
  5. Authorize
    • explicitly approve model movement to test and prod
  6. Monitor
    • add logging, drift monitoring, security monitoring, and periodic access review

This project is currently between Implement and early Assess.

Current State

Already implemented

  • Azure ML workspace, compute, and environment
  • Key Vault-backed secret resolution
  • Blob-backed durable model artifacts
  • staged AML jobs:
    • extract
    • features
    • train
    • score
    • publish
  • PostgreSQL-backed training and inference metadata
  • standalone score path that can load a fixed registered training run
  • basic secret hygiene through Key Vault + managed identity

Missing before this design pass

  • explicit candidate model registration
  • explicit validation status
  • explicit approval status
  • explicit deployment environment mapping
  • a clean way for production scoring to use a fixed promoted model version

Design Principles

  1. Keep the Azure bill low.
  2. Keep the operator workflow simple.
  3. Separate training from deployment.
  4. Promote metadata first, not infrastructure first.
  5. Use the same codebase and the same core Azure resources for now.
  6. Delay heavy enterprise features until the promotion path is stable.
  7. Keep security requirements attached to each lifecycle stage so they are not forgotten.
  8. Prefer identity-based access over shared secrets wherever possible.
  9. Default to least privilege, immutability, and auditability.

Least-Cost Azure Resource Strategy

For now, do not create separate Azure stacks for dev, test, and prod.

Use one low-cost shared platform and separate environments logically through metadata and process:

  • AML Workspace: one shared workspace
  • AML Compute: one small CPU cluster with autoscale-to-zero
  • Blob Storage: one shared pilot storage account with model/run prefixes
  • PostgreSQL Flexible Server: one small server for metadata and batch results
  • Key Vault: the workspace Key Vault

This is the least-cost approach because:

  • no duplicate workspaces
  • no duplicate compute clusters
  • no duplicate Postgres servers
  • no duplicate storage accounts

The tradeoff is that dev/test/prod are logical stages, not hard infrastructure isolation. That is acceptable for this stage of the project.

Security backlog for the low-cost shared-platform choice

Required later:

  • separate Azure subscriptions or management groups for higher-trust environments
  • environment-specific storage accounts and databases
  • private networking for AML, Blob, Postgres, and Key Vault
  • CMK-backed encryption where required by policy
  • stricter RBAC separation between developers, approvers, and operators

Accepted for now:

  • one shared Azure platform with logical environment separation
  • public network access where already required for speed of setup

This is acceptable for the current stage, but not the final target state for a financial system.

Minimal control references to use in this design

The following NIST-style control families are the most relevant minimal set for this project.

These are not all implemented today. They are the reference set for backlog planning.

Access control

  • AC-2 Account Management
  • AC-3 Access Enforcement
  • AC-6 Least Privilege
  • AC-17 Remote Access

Use here for:

  • AML workspace role separation
  • Key Vault access restriction
  • promotion-role separation

Audit and accountability

  • AU-2 Event Logging
  • AU-3 Content of Audit Records
  • AU-6 Audit Record Review, Analysis, and Reporting
  • AU-12 Audit Record Generation

Use here for:

  • promotion events
  • approval evidence
  • scoring and publication trail

Configuration and change management

  • CM-2 Baseline Configuration
  • CM-3 Configuration Change Control
  • CM-5 Access Restrictions for Change
  • CM-8 System Component Inventory

Use here for:

  • AML environment versioning
  • job YAML control
  • environment promotion discipline

Identification and authentication

  • IA-2 Identification and Authentication
  • IA-5 Authenticator Management

Use here for:

  • operator identities
  • secret rotation
  • managed identity usage

System and communications protection

  • SC-7 Boundary Protection
  • SC-8 Transmission Confidentiality and Integrity
  • SC-12 Cryptographic Key Establishment and Management
  • SC-13 Cryptographic Protection
  • SC-28 Protection of Information at Rest

Use here for:

  • private endpoints
  • TLS enforcement
  • encryption at rest
  • Key Vault-backed secret storage

System and information integrity

  • SI-3 Malicious Code Protection
  • SI-4 System Monitoring
  • SI-7 Software, Firmware, and Information Integrity
  • SI-10 Information Input Validation

Use here for:

  • artifact integrity
  • validation of job inputs and model metadata
  • monitoring for unauthorized drift

Risk assessment

  • RA-3 Risk Assessment
  • RA-5 Vulnerability Monitoring and Scanning

Use here for:

  • threat modeling
  • dependency scanning
  • periodic review of AML/storage/Postgres exposure

Supply chain / provenance

  • SR-3 Supply Chain Controls and Processes
  • SR-11 Component Authenticity

Use here for:

  • image provenance
  • package provenance
  • signed artifact direction

Contingency / resilience

  • CP-2 Contingency Plan
  • CP-9 System Backup
  • CP-10 System Recovery and Reconstitution

Use here for:

  • rollback plan
  • recovery from bad model promotion
  • durable artifact recovery

Minimal Promotion Architecture

Training system of record

ndx_az_elasticnet.training_runs

This stores:

  • model name
  • feature set
  • horizon
  • training metrics
  • parameters
  • durable model_uri

Security expectations for this layer:

  • all training records should be append-only from an operator perspective
  • mutable status changes should be auditable
  • no secrets should ever be stored in metrics or parameter JSON
  • model URIs should point only to approved internal storage locations

Model registry

ndx_az_elasticnet.model_versions

This is the promotion-facing registry layer on top of training_runs.

Each row represents one registered candidate model version tied to exactly one successful training run.

It stores:

  • model_version_id
  • training_run_id
  • model_uri
  • metrics and parameters
  • source commit
  • training data window
  • candidate_status
  • validation_status
  • approval_status

Security expectations for this layer:

  • promotion status changes must be attributable to a principal or workflow
  • approval should eventually require a distinct role from the trainer role
  • registry records should be tamper-evident through audit logs
  • model metadata should include source commit and training-data window for traceability

Deployment map

ndx_az_elasticnet.model_deployments

This maps one active model version to one logical environment:

  • dev
  • test
  • prod

This lets production scoring use a fixed approved version instead of “latest successful training run”.

Security expectations for this layer:

  • production deployment mapping should only be changeable by a restricted release role
  • environment aliases should not silently advance to a new model without an explicit promotion event
  • prod scoring should refuse unapproved models

Promotion audit trail

ndx_az_elasticnet.model_promotion_events

This records:

  • candidate registration
  • validation decisions
  • approval decisions
  • environment promotions

Security expectations for this layer:

  • every state transition must be recorded
  • operator notes should be mandatory in stricter environments
  • audit records should be exported to a longer-retention sink later

Lifecycle

1. Develop in dev

  • feature code changes
  • training code changes
  • scoring code changes
  • AML packaging changes

This is standard code development.

Security backlog for development:

  • mandatory branch protection
  • signed commits or equivalent commit provenance
  • secret scanning in PRs
  • dependency scanning and CVE policy
  • SAST for Python and shell
  • IaC scanning for AML, storage, Key Vault, and Postgres configuration
  • coding standards for secure SQL, secure serialization, and input handling
  • red-team-inspired abuse cases for:
    • secret leakage
    • artifact tampering
    • promotion bypass
    • poisoned training data assumptions

Relevant references:

  • CM-3, CM-5
  • RA-5
  • SR-11

2. Merge after CI passes

CI should eventually include:

  • unit tests
  • linting
  • packaging validation
  • shell script validation

Current repo already supports the test and packaging checks needed to start.

Security backlog for CI:

  • SCA/dependency scanning
  • SBOM generation
  • container image vulnerability scanning
  • reproducible build metadata
  • policy checks that block use of plaintext secrets
  • test that production scoring only accepts promoted models
  • checks that AML job definitions do not take DB secrets as direct inputs

Relevant references:

  • AU-2, AU-12
  • CM-2
  • RA-5
  • SR-3

3. Train candidate model

Run the staged AML batch or the relevant train flow:

  • extract
  • features
  • train

Training writes:

  • durable model pickle to Blob
  • durable metrics JSON to Blob
  • training metadata to training_runs

Security backlog for training and data handling:

  • source data integrity validation
  • schema drift detection before feature generation
  • training-data lineage capture with hashes or dataset version IDs
  • access control that restricts who can trigger training in higher environments
  • outbound network restriction for AML jobs where practical
  • malware / artifact scanning for persisted model files
  • model serialization review to reduce unsafe pickle risk over time

Financial-data controls required later:

  • confirm retention policy for copied market data
  • document permitted source datasets
  • document whether any licensed data has redistribution restrictions
  • ensure training artifacts never embed raw credentials or private operational data

Relevant references:

  • SC-8, SC-28
  • SI-7, SI-10
  • RA-3

4. Register candidate model

Use:

  • scripts/register_candidate_model.py

This creates a row in model_versions from a successful training run.

The candidate starts as:

  • candidate_status = candidate
  • validation_status = pending
  • approval_status = pending

Security backlog for candidate registration:

  • only successful training runs should be registerable
  • candidate registration should capture the acting principal
  • candidate registration should eventually require policy checks on:
    • metrics completeness
    • source commit presence
    • model URI location

Relevant references:

  • AU-3, AU-12
  • CM-3

5. Run validation pipeline

For now, validation is a controlled operator step backed by metadata.

Use:

  • scripts/validate_candidate_model.py

This marks the model version:

  • passed
  • or failed

Later this can be replaced by a real validation job that checks:

  • metric thresholds
  • schema compatibility
  • reproducibility
  • inference smoke tests

Security backlog for validation:

  • enforce validation thresholds by policy rather than operator judgment alone
  • add adversarial / abuse-oriented validation:
    • malformed artifact path
    • missing or tampered blob artifact
    • stale or mismatched model metadata
    • unexpected schema changes
  • add red-team scenarios:
    • can a non-approved model be injected into test or prod
    • can a manipulated model URI point outside approved storage
    • can recommendation publishing be triggered from an unvalidated run

Relevant references:

  • SI-7
  • RA-3
  • AU-6

6. Promote to test

Use:

  • scripts/promote_model.py --target-environment test

Only validated candidates should move to test.

Test is where you perform:

  • integration checks
  • shadow/batch checks
  • production-like scoring verification

Security backlog for test promotion:

  • require a release principal distinct from the training principal
  • verify Key Vault access still follows least privilege
  • verify environment-scoped scoring uses the deployment mapping, not the latest run shortcut
  • verify logs and audit trail are complete before allowing prod promotion

Relevant references:

  • AC-3, AC-6
  • AU-2, AU-6
  • CM-5

7. Approve model

Use:

  • scripts/approve_model.py

This is the explicit gate between:

  • technically validated
  • and organizationally approved

Security backlog for approval:

  • implement maker-checker separation
  • require human approval in prod path
  • integrate approval evidence with ticket/change-management reference
  • require attestation that validation artifacts were reviewed

Relevant references:

  • AC-2, AC-6
  • AU-3
  • CM-3

8. Promote to prod

Use:

  • scripts/promote_model.py --target-environment prod

Only approved models should reach prod.

Security backlog for prod promotion:

  • require stronger RBAC and ideally PIM/JIT elevation
  • require immutable deployment event records
  • require rollback readiness before cutover
  • require release checklist covering:
    • model version
    • validation evidence
    • approval evidence
    • rollback target

Relevant references:

  • CP-2, CP-10
  • CM-3
  • AU-12

9. Production scoring uses promoted version

Production scoring should resolve the model by deployment environment:

  • --deployment-environment prod

That keeps production inference fixed to the approved version.

Security backlog for production scoring:

  • run with least-privilege identity
  • prevent write access to registry tables from the scoring identity
  • restrict production scoring from training-related secrets
  • produce immutable inference logs and recommendation publication logs
  • monitor for unexpected model-version drift

Relevant references:

  • AC-6
  • SC-7, SC-8
  • SI-4, SI-7

Scripts Added

Candidate registration

  • mmaindx_az_elasticnet/scripts/register_candidate_model.py

Validation decision

  • mmaindx_az_elasticnet/scripts/validate_candidate_model.py

Approval decision

  • mmaindx_az_elasticnet/scripts/approve_model.py

Promotion

  • mmaindx_az_elasticnet/scripts/promote_model.py

Operator Commands

Register latest successful training run as candidate

make register-candidate

Validate a candidate

make validate-candidate MODEL_VERSION_ID=<uuid>

Approve a validated candidate

make approve-model MODEL_VERSION_ID=<uuid>

Promote to test

make promote-model MODEL_VERSION_ID=<uuid> DEPLOYMENT_ENV=test

Promote to prod

make promote-model MODEL_VERSION_ID=<uuid> DEPLOYMENT_ENV=prod

Score by promoted environment

uv run python mmaindx_az_elasticnet/scripts/score_latest.py \
  --as-of-date 2026-06-05 \
  --deployment-environment prod

Security Backlog by Domain

Identity and access

  • separate roles for developer, trainer, validator, approver, and prod operator
  • prefer managed identities over shared credentials everywhere
  • move toward JIT/PIM for prod-changing actions

Secrets management

  • eliminate remaining direct DB URLs from local operator flows where practical
  • rotate source and target DB secrets regularly
  • add secret expiry/rotation policy and ownership

Network security

  • move AML, Blob, Postgres, and Key Vault to private endpoints
  • restrict public ingress and egress
  • add firewall allowlists and DNS design

Supply chain security

  • signed images
  • SBOMs
  • image provenance
  • dependency pinning with vulnerability policy

Data security

  • data classification for market, operational, and derived model data
  • retention and deletion policy
  • encryption at rest and in transit review
  • artifact integrity verification

Logging and monitoring

  • centralized audit logging
  • promotion-event monitoring
  • failed-auth and privilege-escalation monitoring
  • anomaly detection on unexpected production model changes

Red teaming and resilience

Backlog red-team exercises should include:

  • stealing or replaying model artifacts
  • tampering with deployment mappings
  • bypassing approval to reach prod
  • poisoning copied training data
  • abusing AML job definitions to exfiltrate secrets
  • publishing recommendations from an unapproved model

Compliance and governance

  • change-management linkage for prod promotion
  • evidence retention for approvals and validation
  • documented operating procedures for rollback
  • periodic access review

Security priorities by phase

Acceptable for current pilot

  • shared Azure platform with logical environment separation
  • managed identity + Key Vault for core secrets
  • durable model artifacts
  • audit trail for promotion metadata

Required next

  • branch protection and CI scanning
  • candidate / validator / approver role separation
  • formal validation criteria
  • production scoring only through promoted environment mapping
  • secret rotation plan

Primary references:

  • AC-6
  • AU-12
  • CM-3
  • RA-5

Required before real production use

  • private networking for AML / Blob / Postgres / Key Vault
  • stronger RBAC and JIT/PIM for promotion roles
  • centralized audit logging and alerting
  • immutable release evidence and rollback procedure
  • supply-chain controls for images and dependencies
  • formal red-team / adversarial testing
  • periodic access review and governance workflow

Primary references:

  • AC-3
  • AC-6
  • AU-2
  • AU-6
  • SC-7
  • SC-28
  • SI-4
  • SI-7
  • RA-3
  • RA-5
  • SR-3
  • CP-10

Mapping: What Exists vs What Is Missing

Exists now

  • durable model storage
  • training metadata
  • inference metadata
  • candidate registration
  • validation status
  • approval status
  • environment deployment mapping
  • environment-based scoring
  • basic security foundation through Key Vault + managed identity

Still missing

  • real automated validation policy engine
  • AML-native pipeline object
  • separate infrastructure per environment
  • formal model registry service
  • production monitoring and alerting
  • policy-as-code for approval controls
  • maker-checker approval enforcement
  • private networking and environment isolation
  • centralized security logging and alerting
  • red-team validation and adversarial testing
  • supply-chain security controls

Do not jump to a heavy enterprise platform yet.

The next sensible steps are:

  1. Run one full staged AML batch with the new promotion model.
  2. Register the resulting training run as a candidate.
  3. Validate and promote it to test.
  4. Run score in test.
  5. Approve and promote to prod.
  6. Make production scoring use --deployment-environment prod.

After that, add:

  • AML pipeline-job orchestration
  • validation job automation
  • environment-specific configuration overlays
  • stronger audit/security controls
  • threat modeling workshop
  • red-team backlog execution
  • private network migration
  • least-privilege role split for promotion operations

Why this is the right level of architecture now

This design gives you:

  • versioned trained models
  • explicit promotion states
  • explicit deployment targets
  • low Azure cost
  • clear operator workflow

without forcing:

  • multiple workspaces
  • multiple databases
  • a full external registry service
  • a large MLOps platform upfront

This is the correct minimum viable production pattern for the current project stage.