Infrastructure

Azure managed identity: diagnose private access before changing permissions

Build a runbook for Key Vault, Storage or private API access failures with managed identity, RBAC, private DNS, logs and real execution evidence.

07 Jun 2026 azureidentityprivate-endpointdnslogskqlmonitoringautomationrunbookrollback

A managed identity rarely fails in isolation. When an Azure workload can no longer read a Key Vault secret, write to a Storage Account or call a protected internal API, diagnosis often jumps too quickly to RBAC: add a role, broaden a permission, restart the service. In production, that reaction can hide the real issue: wrong principal, access tested with an admin account, public DNS resolution, unreachable Private Endpoint, incomplete RBAC propagation or an application cache.

The use case is an application or automation running through a private Azure path: App Service with VNet Integration, Function, VM, CI runner, AWX job or diagnostic container. It uses a managed identity to access Key Vault, Storage or a protected API. The runbook goal is not only to “restore access”. It must prove which identity really performs the call, which network path it uses, which control rejects the request and which fix stays bounded.

Start with the real execution identity

The first mistake is validating access from an operator workstation or Azure Cloud Shell, then assuming the workload has the same rights. A managed identity is attached to a resource or assigned explicitly. The runbook must identify the principal that performs the production call.

bash 01-managed-identity-context.sh
# From Azure context, identify identities assigned to the workload.
az webapp identity show --resource-group rg-prod-app --name app-private-api --query '{principalId:principalId, tenantId:tenantId, userAssigned:userAssignedIdentities}'

# For a VM, also check user-assigned identities.
az vm identity show --resource-group rg-prod-runners --name vm-runner-01 --query '{principalId:principalId, userAssigned:userAssignedIdentities}'

This prevents a misleading diagnosis. If the pipeline, night worker and web application do not use the same identity, a successful test on one of them proves nothing for the others.

Separate authorization, network and DNS

A Key Vault or Storage access failure can come from several layers. RBAC says whether the identity is allowed to act. The resource firewall says whether the source network is accepted. Private Endpoint and private DNS say whether the path reaches the private address. The application may still present a token for the wrong tenant or keep an old configuration.

The runbook should force that separation.

text managed-identity-access-split.txt
Question
Which identity issues the call?
Which RBAC role or access policy covers the action?
Does the name resolve to a private address from the workload?
Does the resource firewall accept that path?
Do logs show a network denial or identity denial?
Did the application reload its configuration and token?

Decision
Do not add a role until the network path is qualified
Do not change DNS until the real principal is confirmed
Do not revoke previous access until critical consumers are validated

This grid limits broad fixes. It also explains why an application can fail while the same secret is readable from an administration VM.

Verify the private path from the workload

The network test must start from the same network as the application call. If Key Vault or Storage is protected by Private Endpoint, DNS resolution should point to the expected private address. Public resolution can produce a network denial that looks like an identity failure.

bash 02-private-dns-check.sh
HOST=kv-prod-app.vault.azure.net

nslookup "$HOST"

resolved_ip="$(dig +short "$HOST" | tail -n 1)"
case "$resolved_ip" in
10.*|172.16.*|172.17.*|172.18.*|172.19.*|172.2*|172.30.*|172.31.*|192.168.*)
  echo "private_resolution_ok=$resolved_ip"
  ;;
*)
  echo "unexpected_public_or_empty_resolution=$resolved_ip"
  exit 2
  ;;
esac

If this check fails, the likely fix is in the private DNS zone, VNet link, hybrid forwarding or runner execution point. Adding an RBAC role will not change the path.

Read logs as evidence, not noise

Logs should distinguish identity denial from network denial. For Key Vault, diagnostic logs can expose the operation, result, caller address and identity when the platform captures them. The point is not to depend on a single universal schema, but to keep a triage query ready during an incident.

kusto 03-keyvault-managed-identity-denied.kql
let Window = 6h;
AzureDiagnostics
| where TimeGenerated > ago(Window)
| where ResourceProvider == "MICROSOFT.KEYVAULT"
| where ResultType !in ("Success", "Succeeded")
| project TimeGenerated,
        Resource,
        OperationName,
        ResultType,
        ResultSignature,
        CallerIPAddress,
        Identity=tostring(Identity),
        ClientRequestId
| order by TimeGenerated desc

The expected reading is simple: if the identity is missing or unexpected, go back to execution context. If the source address is public or unknown, go back to the private path. If the identity is correct and the network is coherent, then RBAC or policy becomes the main topic.

Build a bounded fix

An emergency fix should remain reversible. Adding Contributor, opening public access or temporarily disabling a firewall may restore service, but those gestures destroy evidence and often create silent debt. The runbook should prefer targeted actions.

text managed-identity-bounded-fix.txt
Targeted fix
Assign the minimal role to the real identity
Fix the private DNS zone link instead of bypassing Private Endpoint
Test from the affected workload or runner
Keep the previous path active only when rollback requires it
Remove the temporary exception with owner and deadline
Document logs, identity, role, source network and validation

Rollback also needs precision. If a previous identity remains necessary, it should be tracked with a removal date. If temporary network access is opened, it should be closed after application validation.

Integrate the check into automation

For a platform operated by pipelines or AWX jobs, the diagnosis should be runnable without improvisation. A controlled job can collect the assigned identity, test DNS resolution, call a health endpoint and show the latest Key Vault or Storage denials. The important constraint is to never expose secret values in logs.

text automation-output-contract.txt
Expected diagnostic job output
Workload or runner tested
PrincipalId used
Resolved name and returned address
Target resource and operation tested
Latest denials found in logs
Recommended fix: identity, DNS, network or RBAC
No secret value displayed

This output turns a vague incident into an actionable decision. The team knows whether to correct an identity assignment, a role, a private DNS zone or application configuration.

Conclusion

Diagnosing a managed identity failure is not about adding permissions until the call works. The right runbook first proves the real identity, network path, DNS resolution and denial type observed in logs.

That discipline protects availability and security at the same time: the fix targets the right layer, while exceptions remain minimal, traceable and reversible. Managed identity then becomes an operable mechanism, not a black box that gets widened at every incident.