Cloud
Azure internal APIM: diagnose a private API before changing policies
Qualify a failure across Application Gateway, WAF, internal APIM and a private backend by separating DNS, routing, policy, identity and logs before any fix.
When an API published through Azure API Management in internal mode starts returning 404, 500, 502 or 403, it is tempting to change the APIM policy, disable a WAF rule or temporarily reopen the backend. On a private path, that reaction mixes too many layers too quickly: client-side DNS resolution, Application Gateway listener, WAF, route to APIM, inbound policy, backend resolution, identity presented to the target service and application logs.
The use case is an internal API flow: an enterprise client calls Application Gateway with WAF, the gateway forwards to internal APIM, APIM applies policies and then calls an Azure Function, a private API or a service behind Private Endpoint. The runbook goal is to find the layer that rejects or breaks the request without weakening the path permanently.
Map the real flow before touching policies
The first step is to write the expected flow as it really behaves, not the simplified diagram. Each DNS name, TLS hostname and identity should be explicit. An APIM policy can be correct while the backend still resolves publicly from the APIM network. A WAF rule may look responsible while the request never reaches APIM. A backend 403 can come from the APIM identity or from a missing application key.
Internal client
Resolves api.internal.example.com
Calls Application Gateway with the expected internal hostname
Application Gateway + WAF
Selects the HTTPS listener
Applies the WAF policy
Forwards to the APIM pool with the right host header
Internal APIM
Receives the request on its private endpoint
Applies inbound/backend/outbound/on-error policies
Resolves the private backend by its real name
Presents the expected identity, certificate, key or token
Private backend
Receives the request through Private Endpoint or a private path
Logs correlationId, identity and application result This map gives a simple rule: do not change an APIM policy until APIM ingress is proven, do not change WAF until a WAF block is visible, and do not open the backend until private resolution from APIM is qualified.
Separate symptoms by layer
HTTP status codes are not enough. A 502 can come from Application Gateway, APIM or the backend. A 403 can be a WAF block, APIM authorization, Key Vault denial, missing Function key or managed identity without the right role. The runbook should force a layered reading.
Symptom
WAF Blocked
Read ruleId, matchVariable, URI and client IP before any exclusion
Application Gateway 502
Check backend health, probe, SNI, host header and certificate toward APIM
APIM 404 or 500
Check API path, operationId, policy error, backend URL and correlationId
Backend 401 or 403
Check identity, secret, certificate, Entra ID token, RBAC and firewall
Backend unreachable from APIM
Check private DNS, Private Endpoint, route, NSG and resolver used The important point is to keep the same time window and the same correlation identifier when possible. Otherwise, the team compares events that do not belong to the same request.
Verify DNS and TLS from the APIM path
APIM should call the backend with its hostname, not with the private IP address. The IP address may help during a one-off network test, but it breaks the TLS model, makes SNI ambiguous and hides private DNS errors. Validation should start from a point that uses the same resolver and path as the APIM instance.
BACKEND_HOST=func-orders-prod.azurewebsites.net
nslookup "$BACKEND_HOST"
resolved_ip="$(dig +short "$BACKEND_HOST" | tail -n 1)"
case "$resolved_ip" in
10.*|172.16.*|172.17.*|172.18.*|172.19.*|172.2*|172.30.*|172.31.*|192.168.*)
echo "private_resolution_ok=$resolved_ip"
;;
*)
echo "unexpected_public_or_empty_resolution=$resolved_ip"
exit 2
;;
esac
openssl s_client -connect "$BACKEND_HOST:443" -servername "$BACKEND_HOST" </dev/null 2>/dev/null | openssl x509 -noout -subject -issuer If this check fails, the expected fix is in the private DNS zone, VNet link, hybrid resolver, route or diagnostic runner location. Changing an APIM policy or WAF rule will not fix that layer.
Read APIM and Application Gateway together
For a private API failure, logs should show whether the request is blocked before APIM, rejected by APIM or denied by the backend. A triage KQL query keeps the reading usable during an incident.
let Window = 2h;
let Hostname = "api.internal.example.com";
let ApiPath = "/orders";
let Gateway =
AzureDiagnostics
| where TimeGenerated > ago(Window)
| where ResourceProvider == "MICROSOFT.NETWORK"
| where Category in ("ApplicationGatewayAccessLog", "ApplicationGatewayFirewallLog")
| where tostring(host_s) == Hostname or tostring(requestUri_s) has ApiPath
| project TimeGenerated,
Layer="application-gateway",
Action=tostring(action_s),
Status=tostring(httpStatus_d),
RuleId=tostring(ruleId_s),
Uri=tostring(requestUri_s),
ClientIp=tostring(clientIP_s),
CorrelationId=tostring(transactionId_g);
let Apim =
AzureDiagnostics
| where TimeGenerated > ago(Window)
| where ResourceProvider == "MICROSOFT.APIMANAGEMENT"
| where tostring(Url) has ApiPath or tostring(RequestUri) has ApiPath
| project TimeGenerated,
Layer="apim",
Action=tostring(OperationName),
Status=tostring(ResponseCode),
RuleId="",
Uri=tostring(Url),
ClientIp=tostring(CallerIPAddress),
CorrelationId=tostring(CorrelationId);
Gateway
| union Apim
| order by TimeGenerated desc The query is not meant to replace application logs. It answers a routing question quickly: is the request blocked by WAF, received by APIM, rejected by APIM or missing from the expected path?
Replay a controlled request
After the likely layer is identified, replay a minimal request with an explicit x-correlation-id. The test should use the same hostname as clients, not a bypass endpoint.
CORRELATION_ID="ops-$(date +%Y%m%d%H%M%S)"
curl -vk "https://api.internal.example.com/orders/health" -H "x-correlation-id: $CORRELATION_ID" -H "Host: api.internal.example.com"
echo "correlation_id=$CORRELATION_ID" If the request appears in Application Gateway but not in APIM, inspect the backend pool, host header, probe and TLS toward APIM. If it appears in APIM but not in the backend, inspect backend policy, private DNS, route and identity. If it appears everywhere with an application denial, the fix is probably not network-related.
Bound the fix and rollback
The fix should remain minimal. A WAF exclusion should target the variable and rule involved. An APIM policy should be versioned and tested with a correlation identifier. A temporary network opening needs an owner, a duration and proof of removal. An identity change must be validated with the real principal used by APIM or the backend.
Before change
Faulty layer identified with logs
Reproducible test request
Security impact understood
Rollback documented
During change
Change one layer only
Keep the correlation identifier
Watch WAF, APIM and backend in the same window
After change
Replay the controlled request
Check for unexpected public exposure
Remove the temporary exception
Document evidence, owner and removal date This discipline prevents repairs that only work because several controls were bypassed at once.
Conclusion
An internal APIM incident is not only a policy topic. It is a complete path combining DNS, TLS, WAF, Application Gateway, APIM, identity and a private backend. The right runbook starts by proving where the request disappears or changes status.
By keeping layers separate, the team can fix quickly without turning urgency into a lasting exception: no global WAF exclusion, no backend reopened by reflex, no policy changed without evidence. The private API remains operable because every decision is backed by a verifiable signal.