Cloud

Azure Private Endpoint: detect Terraform, DNS, and network drift before incident

Build an operational drift reading across Terraform, Private Endpoint, private DNS, CI runners, and validation evidence before a private Azure path breaks in production.

03 Jun 2026 azureprivate-endpointdnsterraformdriftmonitoringrunbookautomation

A Private Endpoint does not always fail when it is created. It often fails later, when a private DNS zone is moved, when a Terraform runner changes network, when a VNet link disappears, or when a remaining public exception hides bad resolution. The symptom then appears on the application side: connection error, blocked deployment, unreachable backend, Function that no longer starts, or terraform plan failure even though the recent change does not look directly related.

The scenario here is an Azure platform where several services are exposed privately: Storage Account, Key Vault, SQL, App Service, Azure Functions, or internal APIs. Terraform creates part of the resources, but daily operations sometimes change surrounding paths: DNS hub, forwarders, network rules, Private Endpoint subnets, CI runners, maintenance exceptions. The objective is not only to detect drift, but to decide whether it threatens a real production path.

Define what Terraform must guarantee

Terraform should not be treated as complete proof of runtime behavior. It describes expected resources, but it does not always prove that the consumer resolves the right name, crosses the right resolver, uses the right runner, or fails correctly from an unauthorized network. Separate declared state, real Azure state, and operational evidence.

text private-endpoint-drift-scope.txt
Declared state
Expected Private Endpoint
Expected Private DNS Zone
Expected DNS zone group
Expected VNet link
Expected public access: disabled or restricted

Real state
Endpoint approved
Private IP assigned
DNS zone really linked
Records present
Effective network rules

Operational evidence
Resolution from the workload
Resolution from the CI runner
HTTPS or TCP access validated
Negative test from an unauthorized network
Rollback documented

This separation avoids a common trap: deciding that a service is private because Terraform shows a Private Endpoint while the workload still uses a public path or a different resolver.

List consumer paths before diagnosis

The same private service may be consumed through several paths. An application with VNet integration, a CI runner, an administration VM, and a support workstation do not necessarily use the same DNS. Diagnosis should therefore start with a simple matrix.

text consumer-paths.txt
Service: stordersprod.blob.core.windows.net

Consumers to validate
Production Function App: VNet integration vnet-app-prod
Terraform runner: subnet ci-runners-prod
Ops bastion VM: vnet-hub-ops
Admin workstation: VPN to hub

For each consumer
Resolver used
Expected DNS result
Expected private IP
Allowed flow
Expected refusal from an external network

Without this matrix, the team may test from the easiest place, not from the place that actually represents the workload. A successful nslookup from a hub VM does not prove that a Function, App Service, or CI runner resolves in the same way.

Compare Terraform and Azure without confusing drift and incident

Drift is not automatically an incident. It may be intentional, temporary, or without immediate impact. It becomes critical when it touches a component that carries proof of the private path: deleted endpoint, missing zone group, removed VNet link, reopened public access, or private IP that no longer matches the DNS record being used.

bash 01-terraform-azure-drift-readout.sh
terraform plan -refresh-only -out=tfplan.refresh
terraform show -no-color tfplan.refresh | sed -n '/private_endpoint/,+80p'

az network private-endpoint show -g rg-network-prod -n pe-stordersprod-blob --query '{name:name, provisioningState:provisioningState, subnet:subnet.id, ip:customDnsConfigs[0].ipAddresses[0]}'

az network private-endpoint dns-zone-group list -g rg-network-prod --endpoint-name pe-stordersprod-blob --query '[].{name:name, zones:privateDnsZoneConfigs[].privateDnsZoneId}'

The important part is keeping a layer-by-layer reading. A refresh-only plan can report a configuration difference, but Azure must confirm whether the endpoint is approved, whether the zone is attached, and whether the expected private IP still exists.

Verify resolution from useful paths

The DNS test must be repeatable and named. It should say where it runs, which FQDN is tested, which CNAME chain is expected, and which private IP is acceptable. Otherwise, diagnosis becomes a pile of screenshots that are hard to compare.

bash 02-check-private-dns-from-paths.sh
SERVICE_FQDN="stordersprod.blob.core.windows.net"
EXPECTED_SUFFIX="privatelink.blob.core.windows.net"
EXPECTED_PREFIX="10.50.20."

nslookup "$SERVICE_FQDN"
dig +short CNAME "$SERVICE_FQDN"
dig +short "$SERVICE_FQDN"

# Expected reading
# - the CNAME chain goes through privatelink.blob.core.windows.net
# - the final address starts with 10.50.20.
# - the test is replayed from the workload and from the CI runner

A public result is not always an immediate outage if public access is still open. That is precisely the risk: the platform can keep working while private-path evidence is already false. The next network hardening step will then turn silent drift into an incident.

Check public exceptions and residual paths

The most deceptive drift is the one that leaves everything working. A firewall rule, trusted service exception, or temporary public access can hide a DNS error. Before closing or changing a Private Endpoint, verify that success comes from the intended path.

bash 03-check-public-access-residue.sh
az storage account show -g rg-app-prod -n stordersprod --query '{publicNetworkAccess:publicNetworkAccess, defaultAction:networkRuleSet.defaultAction, bypass:networkRuleSet.bypass}'

az storage account network-rule list -g rg-app-prod -n stordersprod --query '{ipRules:ipRules[].ipAddressOrRange, virtualNetworkRules:virtualNetworkRules[].virtualNetworkResourceId}'

The right question is not only “does it work?”. Ask “through which path does it work?”. A positive test from an unauthorized network should be treated as drift, even if the application is not complaining yet.

Put the Terraform runner in scope

Teams often validate resolution from the workload and then forget the Terraform runner. Yet if the backend, providers, or validation scripts need to reach private services, the runner becomes a consumer. A subnet or DNS change on runners can block init, plan, apply, or post-deployment checks.

text terraform-runner-contract.txt
Terraform runner contract
Known execution network
Documented DNS resolver
State backend access validated
Access to private services required by tests validated
CI identity separated from human accounts
Maintenance procedure if the primary runner is unavailable

This contract prevents every CI failure from being treated as a Terraform issue. Sometimes the code did not change: the runner network path is no longer the one used in previous evidence.

Turn diagnosis into a change guardrail

The useful check is not a yearly audit. It should become a short step before changes that touch Private Endpoint, Private DNS Zone, routing, firewall, CI runner, or public access. The goal is to block ambiguous changes, not slow every deployment.

text private-path-change-gate.txt
Before change
Identify affected FQDNs
List critical consumers
Capture DNS and access from workload + runner
Check Terraform refresh-only state
Check residual public exceptions

After change
Replay the same tests from the same paths
Confirm negative test from an external network
Document private IP, zone, VNet link, and application result
Keep DNS/network rollback explicit

The guardrail must stay short enough to be used. Exhaustive tests end up ignored. The five or six proofs that really protect the private path should be easy to replay.

Plan DNS rollback before touching the network

Private Endpoint rollback does not always mean deleting the resource. In many cases, the cleanest return path is restoring a zone link, reverting a forwarder, adding a limited temporary network rule, or moving a runner back to its previous network. This decision should be written before the change window.

text private-dns-rollback.txt
Possible rollback
Restore previous VNet link
Revert to previous DNS forwarder
Temporarily reactivate a limited network rule
Move the Terraform runner back to the validated subnet
Replay DNS tests from workload and runner
Remove temporary exception after stabilization

Without explicit rollback, the team may reopen public access too broadly to restore service. The return path should therefore be as targeted as the diagnosis.

Conclusion

Drift around a Private Endpoint is rarely a single Terraform difference. It is often a gap between declared resource, real Azure state, and proof from the right paths. To keep a private network operable, connect Terraform, private DNS, CI runners, public exceptions, and application tests.

A healthy base is to document consumers, verify the DNS chain from both workload and runner, compare Terraform state with Azure state, and also test refusal from unauthorized paths. With this evidence, a Private Endpoint change becomes a controlled decision rather than a bet on an architecture assumed to be private.