Cloud

Azure internal APIM: diagnose a private API before changing policies

Qualify a failure across Application Gateway, WAF, internal APIM and a private backend by separating DNS, routing, policy, identity and logs before any fix.

08 Jun 2026 azureapi-managementapplication-gatewaywafprivate-endpointdnskqllogsmonitoringidentityrunbookrollback

When an API published through Azure API Management in internal mode starts returning 404, 500, 502 or 403, it is tempting to change the APIM policy, disable a WAF rule or temporarily reopen the backend. On a private path, that reaction mixes too many layers too quickly: client-side DNS resolution, Application Gateway listener, WAF, route to APIM, inbound policy, backend resolution, identity presented to the target service and application logs.

The use case is an internal API flow: an enterprise client calls Application Gateway with WAF, the gateway forwards to internal APIM, APIM applies policies and then calls an Azure Function, a private API or a service behind Private Endpoint. The runbook goal is to find the layer that rejects or breaks the request without weakening the path permanently.

Map the real flow before touching policies

The first step is to write the expected flow as it really behaves, not the simplified diagram. Each DNS name, TLS hostname and identity should be explicit. An APIM policy can be correct while the backend still resolves publicly from the APIM network. A WAF rule may look responsible while the request never reaches APIM. A backend 403 can come from the APIM identity or from a missing application key.

text private-api-flow.txt
Internal client
Resolves api.internal.example.com
Calls Application Gateway with the expected internal hostname

Application Gateway + WAF
Selects the HTTPS listener
Applies the WAF policy
Forwards to the APIM pool with the right host header

Internal APIM
Receives the request on its private endpoint
Applies inbound/backend/outbound/on-error policies
Resolves the private backend by its real name
Presents the expected identity, certificate, key or token

Private backend
Receives the request through Private Endpoint or a private path
Logs correlationId, identity and application result

This map gives a simple rule: do not change an APIM policy until APIM ingress is proven, do not change WAF until a WAF block is visible, and do not open the backend until private resolution from APIM is qualified.

Separate symptoms by layer

HTTP status codes are not enough. A 502 can come from Application Gateway, APIM or the backend. A 403 can be a WAF block, APIM authorization, Key Vault denial, missing Function key or managed identity without the right role. The runbook should force a layered reading.

text layered-reading.txt
Symptom
WAF Blocked
  Read ruleId, matchVariable, URI and client IP before any exclusion

Application Gateway 502
  Check backend health, probe, SNI, host header and certificate toward APIM

APIM 404 or 500
  Check API path, operationId, policy error, backend URL and correlationId

Backend 401 or 403
  Check identity, secret, certificate, Entra ID token, RBAC and firewall

Backend unreachable from APIM
  Check private DNS, Private Endpoint, route, NSG and resolver used

The important point is to keep the same time window and the same correlation identifier when possible. Otherwise, the team compares events that do not belong to the same request.

Verify DNS and TLS from the APIM path

APIM should call the backend with its hostname, not with the private IP address. The IP address may help during a one-off network test, but it breaks the TLS model, makes SNI ambiguous and hides private DNS errors. Validation should start from a point that uses the same resolver and path as the APIM instance.

bash 01-apim-backend-dns-tls-check.sh
BACKEND_HOST=func-orders-prod.azurewebsites.net

nslookup "$BACKEND_HOST"

resolved_ip="$(dig +short "$BACKEND_HOST" | tail -n 1)"
case "$resolved_ip" in
10.*|172.16.*|172.17.*|172.18.*|172.19.*|172.2*|172.30.*|172.31.*|192.168.*)
  echo "private_resolution_ok=$resolved_ip"
  ;;
*)
  echo "unexpected_public_or_empty_resolution=$resolved_ip"
  exit 2
  ;;
esac

openssl s_client -connect "$BACKEND_HOST:443" -servername "$BACKEND_HOST" </dev/null 2>/dev/null | openssl x509 -noout -subject -issuer

If this check fails, the expected fix is in the private DNS zone, VNet link, hybrid resolver, route or diagnostic runner location. Changing an APIM policy or WAF rule will not fix that layer.

Read APIM and Application Gateway together

For a private API failure, logs should show whether the request is blocked before APIM, rejected by APIM or denied by the backend. A triage KQL query keeps the reading usable during an incident.

kusto 02-apim-private-api-triage.kql
let Window = 2h;
let Hostname = "api.internal.example.com";
let ApiPath = "/orders";
let Gateway =
AzureDiagnostics
| where TimeGenerated > ago(Window)
| where ResourceProvider == "MICROSOFT.NETWORK"
| where Category in ("ApplicationGatewayAccessLog", "ApplicationGatewayFirewallLog")
| where tostring(host_s) == Hostname or tostring(requestUri_s) has ApiPath
| project TimeGenerated,
        Layer="application-gateway",
        Action=tostring(action_s),
        Status=tostring(httpStatus_d),
        RuleId=tostring(ruleId_s),
        Uri=tostring(requestUri_s),
        ClientIp=tostring(clientIP_s),
        CorrelationId=tostring(transactionId_g);
let Apim =
AzureDiagnostics
| where TimeGenerated > ago(Window)
| where ResourceProvider == "MICROSOFT.APIMANAGEMENT"
| where tostring(Url) has ApiPath or tostring(RequestUri) has ApiPath
| project TimeGenerated,
        Layer="apim",
        Action=tostring(OperationName),
        Status=tostring(ResponseCode),
        RuleId="",
        Uri=tostring(Url),
        ClientIp=tostring(CallerIPAddress),
        CorrelationId=tostring(CorrelationId);
Gateway
| union Apim
| order by TimeGenerated desc

The query is not meant to replace application logs. It answers a routing question quickly: is the request blocked by WAF, received by APIM, rejected by APIM or missing from the expected path?

Replay a controlled request

After the likely layer is identified, replay a minimal request with an explicit x-correlation-id. The test should use the same hostname as clients, not a bypass endpoint.

bash 03-controlled-request.sh
CORRELATION_ID="ops-$(date +%Y%m%d%H%M%S)"

curl -vk "https://api.internal.example.com/orders/health" -H "x-correlation-id: $CORRELATION_ID" -H "Host: api.internal.example.com"

echo "correlation_id=$CORRELATION_ID"

If the request appears in Application Gateway but not in APIM, inspect the backend pool, host header, probe and TLS toward APIM. If it appears in APIM but not in the backend, inspect backend policy, private DNS, route and identity. If it appears everywhere with an application denial, the fix is probably not network-related.

Bound the fix and rollback

The fix should remain minimal. A WAF exclusion should target the variable and rule involved. An APIM policy should be versioned and tested with a correlation identifier. A temporary network opening needs an owner, a duration and proof of removal. An identity change must be validated with the real principal used by APIM or the backend.

text bounded-fix-checklist.txt
Before change
Faulty layer identified with logs
Reproducible test request
Security impact understood
Rollback documented

During change
Change one layer only
Keep the correlation identifier
Watch WAF, APIM and backend in the same window

After change
Replay the controlled request
Check for unexpected public exposure
Remove the temporary exception
Document evidence, owner and removal date

This discipline prevents repairs that only work because several controls were bypassed at once.

Conclusion

An internal APIM incident is not only a policy topic. It is a complete path combining DNS, TLS, WAF, Application Gateway, APIM, identity and a private backend. The right runbook starts by proving where the request disappears or changes status.

By keeping layers separate, the team can fix quickly without turning urgency into a lasting exception: no global WAF exclusion, no backend reopened by reflex, no policy changed without evidence. The private API remains operable because every decision is backed by a verifiable signal.