Terraform Patterns for Multi-Region Databricks on AWS

Deploying Databricks in a single AWS region with a single workspace is straightforward. The Databricks Terraform provider has good documentation, and you can have a working workspace in an afternoon. But the moment you need multiple regions (for data residency, disaster recovery, or latency optimization), multiple environments (development, staging, production), and consistent governance across all of them, the complexity escalates rapidly.

This is the story of how we built the infrastructure layer for a European regulatory technology platform that required Databricks workspaces across two AWS regions, three environments per region, with Unity Catalog governance spanning the entire deployment. We used Terraform for the modules and Terragrunt for the DRY composition layer that makes multi-region, multi-environment deployment manageable.

Why Terragrunt

Before we get into the modules themselves, it is worth explaining why we chose Terragrunt rather than plain Terraform with workspaces or a custom wrapper.

Terraform workspaces are the built-in mechanism for managing multiple instances of the same configuration, but they share a single state backend configuration and do not support varying the provider configuration per workspace. When you need different AWS regions per environment, or different Databricks workspace URLs, workspaces fall short.

Terragrunt solves this by providing a composition layer on top of Terraform modules. Each “unit” in the Terragrunt hierarchy is a thin wrapper that specifies which Terraform module to use, what inputs to pass, and where to store the state. The Terraform modules themselves remain clean, reusable, and environment-agnostic.

The alternative — writing a custom Python or Bash wrapper to manage multiple Terraform runs — is a path we have walked on previous projects and regretted. Custom wrappers accumulate edge cases, break when Terraform changes its CLI interface, and lack the dependency management that Terragrunt provides out of the box.

The Terragrunt Hierarchy

Our directory structure follows a four-level hierarchy: account, region, environment, and component. This maps naturally to the organizational reality of the deployment.

At the top level, we have the account directory, which contains the Terragrunt root configuration file (terragrunt.hcl). This file defines the S3 backend configuration for state storage and the DynamoDB table for state locking. It also defines account-wide variables like the AWS account ID, the organization name, and global tags applied to every resource.

Below the account level are region directories (eu-central-1 for Frankfurt, eu-west-1 for Ireland). Each region directory has its own terragrunt.hcl that defines region-specific variables: the AWS region, the availability zones to use, and the CIDR blocks for the region’s VPC.

Below each region are environment directories (dev, staging, prod). Each environment’s terragrunt.hcl defines environment-specific variables: the environment name (used in resource naming), the Databricks workspace tier (Standard for dev, Premium for staging and prod), cluster size defaults, and auto-termination policies.

At the bottom are component directories. Each component corresponds to a Terraform module deployment: the network module, the Databricks workspace module, the Unity Catalog module, the cluster policy module, the secret scope module. Each component’s terragrunt.hcl specifies the source Terraform module, the input variables (composed from the variables defined at higher levels), and any dependencies on other components.

This hierarchy means that adding a new region requires creating a new region directory and populating it with environment and component configurations. The Terraform modules do not change. Adding a new environment within an existing region requires creating a new environment directory. Again, no module changes. This separation of structure from logic is the key benefit of the Terragrunt approach.

Key Terraform Modules

We developed a library of Terraform modules specifically for Databricks on AWS. Each module encapsulates a discrete concern and exposes a clean interface.

Workspace Provisioning Module

The workspace module handles the creation of a Databricks workspace, including the cross-account IAM role that Databricks uses to manage resources in the customer’s AWS account, the S3 bucket for the workspace’s root storage (DBFS), the workspace configuration via the Databricks MWS (Multi-Workspace Service) API, and the credential and storage configurations that link the workspace to AWS resources.

The module accepts inputs for the workspace name, region, VPC configuration (from the network module output), storage encryption key ARN (from a KMS module), and the Databricks account ID. It outputs the workspace URL, workspace ID, and the host URL needed by other modules to configure resources within the workspace.

A key design decision in this module is the separation of workspace creation from workspace configuration. The MWS API creates the workspace (an account-level operation), while subsequent Terraform resources configure it (workspace-level operations using a different provider alias). We handle this by defining two Databricks provider aliases in the module: one authenticated with account-level credentials for MWS operations, and another authenticated with workspace-level credentials for internal configuration. Terragrunt’s dependency mechanism ensures the workspace exists before any workspace-level configuration is attempted.

Unity Catalog Module

Unity Catalog is Databricks’ governance layer, and its multi-region behavior requires careful planning.

The module manages the metastore (one per region — this is a hard constraint from Databricks), catalogs within the metastore (typically one per environment: dev, staging, prod), schemas within each catalog, and grants that control access at each level.

The metastore is the most region-sensitive component. A Unity Catalog metastore is bound to a single AWS region, and each Databricks workspace can be attached to exactly one metastore. In our two-region deployment, this means two metastores, each serving the workspaces in its respective region.

The module accepts a catalog structure definition as a map of catalog names to their schema lists. For example, the production catalog might have schemas for bronze, silver, gold, and ml_features. The dev catalog might mirror this structure but with a separate sandbox schema for experimentation.

Grants are defined declaratively in the module inputs, specifying which groups (mapped from the identity provider) receive which permissions at which level. We follow a principle of least privilege: data engineering groups get read/write on bronze through gold schemas, data science groups get read on gold and read/write on ml_features, and analytics groups get read-only on gold.

Cluster Policy Module

Cluster policies in Databricks control what kinds of compute resources users can create. In a multi-environment setup, policies differ significantly: development environments allow small, short-lived clusters for experimentation; production environments enforce specific instance types, auto-scaling ranges, and Spark configurations.

The module defines cluster policies as JSON templates with environment-specific overrides. A base policy sets the common constraints (approved instance type families, maximum auto-termination timeout, required Spark configurations like Delta optimization settings). Environment overlays adjust the specific values. Production policies enforce larger minimum cluster sizes, stricter instance type allowlists, and longer auto-termination windows to avoid unnecessary restarts during long-running jobs.

Each policy is associated with a group, and the module creates the group-to-policy mappings. This ensures that when a data engineer creates a cluster, they can only select from configurations that are approved for their environment and role.

Secret Scope Module

Databricks secret scopes store sensitive values (API keys, database credentials, connection strings) that notebooks and jobs reference at runtime. The module manages secret scopes backed by AWS Secrets Manager, creating the scope, populating it with references to Secrets Manager entries, and configuring access control lists.

The important pattern here is that the Terraform module never handles the secret values themselves. The module creates the secret scope and the ACLs, but the actual secret values are managed through a separate process (a manual upload to Secrets Manager or a separate CI/CD pipeline with restricted access). This separation ensures that Terraform state files — which are stored in S3 and accessible to infrastructure engineers — never contain production credentials.

Network Module

The network module provisions the VPC, subnets, NAT gateways, route tables, and security groups required by Databricks workspaces. It also configures VPC peering (for cross-VPC access to data sources) and AWS PrivateLink endpoints (for secure connectivity to the Databricks control plane and to AWS services like S3 and STS).

Databricks on AWS has specific networking requirements: two private subnets in different availability zones for the data plane, no public IP addresses on data plane instances, and egress through NAT gateway or VPC endpoints. The module encodes all of these requirements, so consumers of the module do not need to remember them.

PrivateLink configuration deserves specific mention because it is both important and fiddly. The module creates VPC endpoints for the Databricks workspace (REST API), the Databricks SCC relay (secure cluster connectivity), S3 (gateway endpoint for DBFS access), STS (for IAM role assumption), and Kinesis (if streaming workloads are in use). Each endpoint requires its own security group, and the security groups must allow traffic from the Databricks data plane subnets.

We initially underestimated the PrivateLink configuration effort. It accounts for roughly 30% of the network module’s code, mostly because of the security group rules and the DNS configuration required for each endpoint. But the security benefit is substantial: with PrivateLink, none of the data plane traffic traverses the public internet.

IAM Role Patterns

AWS IAM is the other major area of complexity. A Databricks deployment involves several IAM roles with distinct trust relationships and permission boundaries.

The cross-account role is assumed by the Databricks control plane to manage resources in the customer’s AWS account. It needs permissions to create and manage EC2 instances, EBS volumes, security groups, and VPC resources within the data plane VPC. The module defines this role with a trust policy that allows the Databricks AWS account to assume it, scoped to the specific Databricks account ID.

Instance profiles are attached to the EC2 instances that run Databricks clusters. These profiles grant the clusters access to S3 buckets (for reading data and writing results), Secrets Manager (for retrieving credentials at runtime), and any other AWS services the workloads require. We create separate instance profiles for different use cases: a base profile with S3 read access for general analytics, an elevated profile with S3 write access for ETL jobs, and a restricted profile with access only to specific buckets for sensitive workloads.

Cross-account access roles are used when Databricks workloads need to access resources in other AWS accounts — for example, reading data from a data lake in a central data account. These roles use a trust policy that allows the Databricks instance profile role to assume them, creating a role-chaining pattern.

All of these roles are defined in Terraform modules with minimal permissions (least privilege) and with explicit boundary policies that prevent privilege escalation. The modules also create the corresponding Databricks instance profile resources that register these IAM roles with the workspace.

State Management

Terraform state management for a multi-region, multi-environment deployment requires careful organization. We use S3 as the state backend with DynamoDB for locking, configured at the account level in the Terragrunt root configuration.

Each component in each environment in each region gets its own state file, stored in a path that mirrors the Terragrunt directory hierarchy. For example, the network module for production in Frankfurt stores its state at a path like the account name, followed by eu-central-1, then prod, then network, then terraform.tfstate.

This granular state separation has two benefits. First, it limits the blast radius of state corruption or misconfiguration. A corrupted state file for the dev network in Ireland does not affect the production workspace in Frankfurt. Second, it enables parallel operations — different team members can apply changes to different components simultaneously without state locking conflicts.

State files are encrypted at rest (S3 SSE-KMS) and access is controlled through S3 bucket policies that restrict access to the infrastructure CI/CD pipeline’s IAM role. No human has direct access to state files; all operations go through the CI/CD pipeline.

CI/CD Pipeline

The infrastructure CI/CD pipeline follows the plan-on-PR, apply-on-merge pattern.

When an engineer opens a pull request that modifies any Terragrunt or Terraform files, the CI pipeline runs terragrunt plan for every affected component (Terragrunt’s dependency detection identifies which components are affected by the change). The plan output is posted as a comment on the pull request, allowing reviewers to see exactly what resources will be created, modified, or destroyed.

After code review and approval, merging to the main branch triggers the CD pipeline, which runs terragrunt apply for the affected components in dependency order. The pipeline applies changes to dev first, waits for health checks, then applies to staging, waits again, and finally applies to prod. Each stage requires manual approval through the CI/CD platform’s gate mechanism.

For destructive changes (resource deletion, replacement), the pipeline requires an additional approval from a senior engineer. This is enforced by a policy-as-code check that inspects the plan output for destroy or replace operations and escalates the approval requirement accordingly.

Drift Detection and Compliance

Infrastructure drift — resources modified outside of Terraform — is a reality in any organization. We run a scheduled drift detection job that executes terragrunt plan across all components nightly and reports any detected drift to a Slack channel.

Common sources of drift include manual changes made through the Databricks UI (someone adjusting a cluster policy for debugging), AWS resource modifications made through the console (a network engineer adding a security group rule for troubleshooting), and changes made by AWS services themselves (auto-scaling modifying instance counts, which then appear as drift in cluster configurations).

For each drift detection, the team evaluates whether the change should be incorporated into Terraform (updating the code to match the new desired state) or reverted (running terragrunt apply to restore the Terraform-defined state). The goal is not zero drift at all times — that is unrealistic — but rapid detection and resolution.

We also run a compliance check that verifies all deployed resources conform to organizational policies: encryption is enabled on all storage, public access is blocked on all S3 buckets, security groups do not allow unrestricted inbound access, and IAM roles have boundary policies attached. This check runs as part of the CI/CD pipeline (blocking non-compliant changes) and as a nightly scan (catching drift-introduced non-compliance).

Common Pitfalls

We learned several lessons the hard way during this engagement. Here are the most notable.

Unity Catalog metastore region binding: Each Unity Catalog metastore is permanently bound to a single AWS region. You cannot move a metastore to a different region after creation. Plan your metastore topology carefully before deploying, because changing it later requires recreating catalogs and migrating metadata. In our case, we needed two metastores (one per region), which means cross-region data sharing requires explicit grants through Unity Catalog’s sharing features rather than direct catalog access.

Cross-region latency for Unity Catalog operations: Unity Catalog metadata operations (listing schemas, checking grants, resolving table locations) add latency when the workspace and metastore are in different regions. This is not an issue in our deployment (each workspace connects to its regional metastore), but it would be a concern if you tried to use a single metastore for multiple regions. Do not do this.

Databricks provider authentication sequencing: The Databricks Terraform provider needs different credentials for account-level operations (MWS workspace creation) and workspace-level operations (creating clusters, policies, catalogs). If you try to configure workspace-level resources before the workspace is fully provisioned and accessible, Terraform will fail with authentication errors. Terragrunt’s dependency mechanism handles this, but you must explicitly declare the dependency — it is not inferred.

VPC endpoint DNS resolution: When using PrivateLink, the DNS resolution for Databricks endpoints must be configured to resolve to the private IP addresses of the VPC endpoints, not the public addresses. This requires Private Hosted Zones in Route 53. If DNS is misconfigured, the workspace appears to work (the control plane is reachable), but data plane operations fail silently or with opaque errors because the clusters cannot reach S3 or STS through the private path.

Terraform state file size: Unity Catalog resources (catalogs, schemas, grants) can produce large state files because each grant is a separate Terraform resource. In a deployment with multiple catalogs, many schemas, and fine-grained grants, the state file for the Unity Catalog module can grow to several megabytes, slowing plan and apply operations. We mitigate this by splitting the Unity Catalog module into sub-modules (metastore, catalogs, grants) with separate state files.

Instance profile propagation delay: When Terraform creates an IAM instance profile and immediately tries to register it with Databricks, the registration can fail because IAM changes take time to propagate across AWS. We added a deliberate delay (using the time_sleep resource) between instance profile creation and Databricks registration. Thirty seconds is usually sufficient, but we use sixty for safety.

Results

The infrastructure layer we built delivers concrete benefits.

Provisioning time reduced from weeks to hours. Before this engagement, deploying a new Databricks environment was a multi-week project involving manual console operations, ad-hoc scripts, and extensive verification. Now, a new environment is a Terragrunt variable file and a CI/CD pipeline run.

Consistent governance across all environments. Every workspace has the same Unity Catalog configuration, the same cluster policies, the same network security posture. Environment-specific differences (cluster sizes, auto-termination policies) are intentional and documented in the variable files.

Auditable infrastructure changes. Every modification goes through a pull request with a plan review. The Git history provides a complete audit trail of what changed, when, why, and who approved it.

Repeatable disaster recovery. If a region fails, we can provision replacement infrastructure in a different region by running the Terragrunt pipeline with a new region configuration. The RTO for infrastructure (not data, which depends on replication) is under two hours.

The investment in Terraform modules and Terragrunt composition pays dividends every time a new requirement emerges. When the client requested a sandbox workspace for a data science proof-of-concept, we had it running in half a day — not because the requirement was simple, but because the tooling made it routine.