Terraform on Azure: Structuring Multi-Environment IaC for Enterprise Teams

The directory layout, state strategy, and pipeline patterns that separate enterprise-grade Terraform from repos that collapse under their own complexity.

Tags: Terraform, Azure, IaC, Multi-Environment, Enterprise, DevOps, GitOps

Most teams start Terraform the same way: a single directory, a single state file, and three variable files for dev, staging, and prod. It works for a few months. Then someone runs `terraform apply -var-file=prod.tfvars` from the wrong directory, and the incident retrospective introduces a policy that makes everything slower without solving the underlying structural problem.

The structural problem is this: multi-environment IaC is not a naming convention problem. It is a blast-radius problem. The goal is to make it impossible to accidentally touch production when you mean to touch staging. That requires isolation at the state level, enforced by directory boundaries and pipeline gates, not by developer discipline.

The anti-patterns enterprise teams reach for first

Two patterns appear reliably as teams scale beyond a handful of resources. The first is Terraform workspaces. The appeal is obvious: one directory, multiple states, switch with a single command. The problem is that workspaces were designed for ephemeral environments, not long-lived environment tiers. They share a backend configuration and, critically, they share module code with no ability to vary structural differences between environments. A production environment that runs three replicas of a service, with private endpoints and no public access, is architecturally different from dev, not just numerically different. Workspaces do not model that cleanly.

The second anti-pattern is environment branches: a `dev` branch, a `staging` branch, and a `main` branch that each hold slightly diverged infrastructure code. This looks like GitOps until the branches drift. A hotfix applied to `main` that never gets cherry-picked to `dev` means your dev environment does not reflect production architecture. Three months later the team is debugging an issue that cannot be reproduced in dev because the branches have silently diverged in ways nobody fully tracks.

The test of a good multi-environment layout: can a new team member look at the repository and immediately understand which directory corresponds to which environment, which state file owns which resources, and what pipeline runs on which trigger? If the answer requires reading a README, the structure is not good enough.

Directory-based isolation: the pattern that scales

The structure that holds up under enterprise conditions separates configuration from modules and makes environment boundaries explicit at the filesystem level. Each environment is a top-level directory with its own backend configuration and its own Terraform root. Shared infrastructure logic lives in versioned modules. The environment root calls the module; the module does not know which environment it is running in.

infra/
├── modules/
│   ├── networking/       # hub-spoke, NSGs, DNS
│   ├── compute/          # AKS, VM Scale Sets
│   ├── storage/          # Storage Accounts, Key Vault
│   └── monitoring/       # Log Analytics, alerts
│
├── environments/
│   ├── dev/
│   │   ├── backend.tf    # dev state bucket config
│   │   ├── main.tf       # module calls
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   │   ├── backend.tf
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   └── prod/
│       ├── backend.tf
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
│
└── shared/
    ├── dns-zones/        # cross-env DNS resources
    └── management/       # Log Analytics workspace, Sentinel

Each environment root is a complete Terraform configuration. `terraform init` in `environments/prod` initialises against the production backend. Nothing in the `prod` directory can accidentally reference staging state unless you explicitly wire a remote state data source — which, if you do it, is intentional and visible in code review.

Remote state on Azure Blob Storage

Azure Blob Storage with state locking is the natural backend for Azure-hosted infrastructure. One storage account per environment, with blob versioning enabled for state history, gives you independent state files that cannot be confused with each other. The storage account itself is typically provisioned manually or via a bootstrap script — it cannot be managed by the Terraform configuration it stores state for.

terraform {
  backend "azurerm" {
    resource_group_name  = "rg-tfstate-prod"
    storage_account_name = "satfstateprod001"
    container_name       = "tfstate"
    key                  = "infra/prod/terraform.tfstate"
    use_oidc             = true
    subscription_id      = "00000000-0000-0000-0000-000000000000"
  }
}

Use `use_oidc = true` instead of storing service principal credentials in pipeline variables. Azure DevOps and GitHub Actions both support OIDC federation with Entra ID — no secrets stored, no rotation required. Your pipeline identity is a managed identity or federated credential that works only from your pipeline, not from a developer's laptop.

Naming the state key with the full environment path (`infra/prod/terraform.tfstate`) makes accidental collision nearly impossible even if teams share a storage account for cost reasons. In practice, separate storage accounts per environment are worth the marginal cost — they allow you to assign different RBAC permissions so a dev pipeline identity cannot read prod state.

Module versioning and the promotion model

Modules are shared code. The question of how environments reference modules determines whether your infrastructure changes are safe to promote. The two practical approaches are Git tags and a private Terraform registry.

module "networking" {
  source  = "git::https://github.com/your-org/tf-modules.git//modules/networking?ref=v1.4.2"

  resource_group_name = var.resource_group_name
  location            = var.location
  address_space       = var.vnet_address_space
  subnet_config       = var.subnet_config

  tags = local.common_tags
}

module "compute" {
  source  = "git::https://github.com/your-org/tf-modules.git//modules/compute?ref=v2.1.0"

  resource_group_name    = var.resource_group_name
  location               = var.location
  kubernetes_version     = var.kubernetes_version
  node_pool_config       = var.node_pool_config
  subnet_id              = module.networking.aks_subnet_id

  tags = local.common_tags
}

Dev references a module tag. Staging references the same or a newer tag. Prod references a tag that has been validated in staging. The promotion path is explicit: upgrade the module tag in dev, open a PR, merge, apply, validate. Then upgrade in staging, repeat. Then prod. If something breaks in staging, prod is never exposed to the change.

  1. Module change tagged v1.5.0
  2. Dev references v1.5.0 → apply
  3. Dev validation passes
  4. Staging upgraded to v1.5.0 → apply
  5. Staging validation passes
  6. Prod upgraded to v1.5.0 → apply

Variable strategy: what changes per environment

Variables in a multi-environment repo serve two purposes: expressing what is structurally different between environments, and capturing environment-specific values like subscription IDs or SKUs. Conflating the two creates variable files that are impossible to audit.

A clean separation: `variables.tf` declares the interface (what the environment root accepts). `terraform.tfvars` provides the values. Secrets — subscription IDs, service principal object IDs, anything you would not want in a public repository — are passed as pipeline environment variables or read from Azure Key Vault at apply time using a data source.

# Non-sensitive environment values — safe to commit
location             = "westeurope"
environment          = "prod"
kubernetes_version   = "1.29"
node_pool_vm_sku     = "Standard_D4s_v5"
node_pool_min_count  = 3
node_pool_max_count  = 10
vnet_address_space   = ["10.0.0.0/16"]

# Secrets are NOT here — they come from pipeline vars or Key Vault
# subscription_id   → TF_VAR_subscription_id in pipeline
# tenant_id         → TF_VAR_tenant_id in pipeline

Keep `terraform.tfvars` free of secrets. Reviewers need to see environment configuration in code review; they cannot review what is hidden in a pipeline variable store. A useful test: if every line of your tfvars file could appear in a public repository without concern, the variable split is correct.

CI/CD: pipeline triggers and the approval gate

The pipeline structure mirrors the directory structure. Each environment has its own pipeline stage, triggered by changes to that environment's directory or the modules it references. Production requires a manual approval gate before apply — not because automation cannot be trusted, but because a human checkpoint creates a forcing function for the team to confirm the plan before irreversible changes execute.

GitOps Deployment Flow

Pipeline architecture per environment

Pull Request stage
  • terraform fmt –check
  • terraform validate
  • terraform plan (saved as artefact)
  • tfsec / Trivy security scan
  • Plan output posted to PR comment
Dev auto-apply (on merge to main)
  • Triggered: changes under environments/dev/ or modules/
  • terraform apply -auto-approve
  • Post-apply smoke tests
  • Notify team channel on failure
Prod gated apply
  • Triggered: changes under environments/prod/
  • terraform plan (re-run for freshness)
  • Manual approval required
  • terraform apply with plan file
  • Drift detection scheduled daily

Staging sits between dev and prod with the same auto-apply pattern as dev but triggered independently. The pipeline never runs `apply` across environments in a single execution — each environment is a distinct pipeline run with its own identity, permissions, and audit log.

Drift detection and state hygiene

Even with strict IaC discipline, drift happens. A security team applies a mandatory policy assignment via the portal. An engineer adjusts a firewall rule to recover from an incident and forgets to replicate the change in code. A third-party integration creates resources in your subscription.

A scheduled drift detection pipeline runs `terraform plan` daily against each environment and posts a summary to a monitoring channel. No apply happens — this is purely observational. If the plan is non-empty, it becomes a team action item. This turns drift from a silent accumulating risk into a visible, tracked signal.

schedules:
  - cron: "0 6 * * 1-5"       # 06:00 UTC, weekdays
    displayName: Daily drift detection
    branches:
      include:
        - main
    always: true               # run even if no code change

stages:
  - stage: DriftCheck
    jobs:
      - job: ProdDrift
        steps:
          - task: TerraformTaskV4@4
            inputs:
              command: plan
              workingDirectory: environments/prod
              environmentServiceName: sc-azure-prod-oidc
              commandOptions: -detailed-exitcode -out=drift.tfplan

          - script: |
              if [ $? -eq 2 ]; then
                echo "##vso[task.logissue type=warning]Drift detected in prod"
                echo "##vso[task.setvariable variable=DriftFound]true"
              fi
            displayName: Check exit code

          - task: PostToTeams@1
            condition: eq(variables['DriftFound'], 'true')
            inputs:
              webhookUrl: $(TEAMS_WEBHOOK_URL)
              message: "⚠️ Terraform drift detected in prod — review plan artefact"

What this looks like at scale

The structure described here scales to dozens of environments and hundreds of resources without requiring significant changes. New environments — a canary environment, a pen-test environment, a customer-specific isolated environment — are added by creating a new directory under `environments/`, copying the backend and variable structure, and wiring a pipeline stage. The modules are unchanged.

Before: Before Single Terraform directory for all envs Workspace-based separation Manual terraform apply from developer laptops Secrets in .tfvars files No drift detection Environment branches drifting from each other Incident: prod apply from wrong context

After: After Directory-per-environment isolation Independent state files per environment All applies via pipeline with OIDC identity Secrets from Key Vault or pipeline vars only Scheduled drift detection with team alerting Single main branch, environment config in directories Blast radius limited by structure, not process

The investment is a few days of restructuring and pipeline work. The return is a codebase where the blast radius of any change is structurally bounded, where promotions are auditable git operations, and where the team can move fast on dev without ever being one wrong command away from a production incident.

This is the type of platform engineering work our Cloud Platform and Governance use cases cover. If your team is running Terraform in a way that is starting to feel fragile, or you are building from scratch and want to avoid the re-architecture later, this is exactly the kind of engagement our Builders team delivers.

Need this for your project?

We cover this exact scenario. Strategy, delivery, or both. See our services or get in touch.