Cloud
Azure AKS: diagnose private ingress before changing deployments
Build an operational runbook for AKS private ingress failures by separating DNS, Application Gateway, ingress controller, Kubernetes service endpoints, pod readiness and rollback evidence.
A private AKS application can fail in several places before the request reaches the container. The hostname may still resolve publicly, Application Gateway may probe the wrong path, the ingress controller may not receive the route, the Kubernetes service may have no endpoints, or the pods may be unready after a deployment. From the outside, many of these situations look like the same 502, timeout or empty response.
The use case is an internal web API hosted on AKS. Internal clients call api.internal.example.com, traffic enters through Application Gateway or another private edge, then reaches an ingress controller in the cluster. A release has just changed an ingress rule, service selector, readiness probe or deployment image. Before scaling pods, changing WAF rules or rolling back blindly, the runbook must prove where the request stops.
Draw the private AKS path before touching Kubernetes
AKS adds a cluster layer between the private network and the workload. That layer is useful only if it is observable: DNS, gateway, ingress controller, service, endpoint slices and pod readiness need to be read as one path.
Internal client
Resolves api.internal.example.com
Calls the expected private hostname
Private edge
Application Gateway, private load balancer or internal reverse proxy
Preserves the expected host header
Runs health probes and optional WAF checks
AKS ingress controller
Receives the ingress rule for the hostname and path
Routes traffic to a Kubernetes service
Kubernetes service
Selects pods through labels
Publishes endpoint slices only when pods are ready
Pods
Pass readiness probes
Emit application logs
Reach private dependencies through DNS, identity and network policy This map prevents a common operational mistake: treating an ingress failure as an image problem. If no endpoint exists behind the service, the deployment may be healthy but the selector is wrong. If gateway probes fail before the controller sees traffic, changing the pod will not help.
Separate gateway, ingress and service symptoms
The first diagnostic step is classification. A private DNS problem, a gateway health problem, a missing ingress rule and an empty Kubernetes service do not have the same fix.
Symptom
Hostname resolves to a public or unexpected address
Check private DNS zone, VNet links, resolver forwarding and custom domain target
Application Gateway returns 502
Check backend health, host header, TLS/SNI and probe path toward the ingress controller
Ingress controller logs show no request
Check gateway routing, NSG, private load balancer and controller service address
Ingress controller logs route errors
Check ingress class, host, path, TLS secret and backend service name
Kubernetes service has no endpoints
Check selectors, readiness probes, namespace and pod labels
Pods receive traffic but dependency fails
Check managed identity, private DNS, network policy and downstream firewalls The useful question is not only “are pods running?”. It is “does the exact private hostname reach the expected ingress rule, service endpoints and ready pods?”.
Prove DNS and TLS from the caller network
Start from the same network as the real caller or from a diagnostic runner attached to that network. The goal is to capture the hostname, final address and certificate before entering the cluster.
HOSTNAME=api.internal.example.com
nslookup "$HOSTNAME"
dig +short "$HOSTNAME"
ip=$(dig +short "$HOSTNAME" | tail -n 1)
case "$ip" in
10.*|172.16.*|172.17.*|172.18.*|172.19.*|172.2*|172.30.*|172.31.*|192.168.*)
echo "private_resolution_ok=$ip"
;;
*)
echo "unexpected_public_or_empty_resolution=$ip"
exit 2
;;
esac
openssl s_client -connect "$HOSTNAME:443" -servername "$HOSTNAME" </dev/null 2>/dev/null | openssl x509 -noout -subject -issuer If this fails, stay outside Kubernetes. Fix private DNS, forwarding, zone links, gateway listener configuration or certificate binding first. The cluster cannot repair a hostname that resolves to the wrong place.
Read Kubernetes objects in dependency order
Inside the cluster, avoid jumping directly to pod logs. Read the route objects in the same order as traffic uses them: ingress, service, endpoints, pods and recent events.
NS=prod
INGRESS=orders-api
SERVICE=orders-api
APP_LABEL=orders-api
kubectl -n "$NS" describe ingress "$INGRESS"
kubectl -n "$NS" get ingress "$INGRESS" -o wide
kubectl -n "$NS" describe service "$SERVICE"
kubectl -n "$NS" get endpointslice -l kubernetes.io/service-name="$SERVICE" -o wide
kubectl -n "$NS" get pods -l app="$APP_LABEL" -o wide
kubectl -n "$NS" describe pods -l app="$APP_LABEL" | egrep -i 'Ready|Readiness|Warning|Failed|Unhealthy|Back-off' A service with no endpoints usually means labels or readiness are wrong. An ingress pointing to the wrong service means the controller configuration is wrong. Pods that are ready but return dependency errors move the diagnosis toward identity, DNS or downstream services.
Correlate ingress controller logs with pod logs
The ingress controller often contains the best clue: upstream unavailable, backend not found, TLS mismatch, timeout, path mismatch or no endpoints. Read it with the application logs over the same incident window.
let Window = 2h;
let Namespace = "prod";
let Host = "api.internal.example.com";
let ControllerLogs =
ContainerLogV2
| where TimeGenerated > ago(Window)
| where KubernetesNamespace has_any ("ingress", "ingress-nginx", "appgw")
| where LogMessage has_any (Host, "upstream", "no endpoints", "502", "timeout", "connect", "service")
| project TimeGenerated, Source="ingress-controller", PodName, KubernetesNamespace, LogMessage;
let AppLogs =
ContainerLogV2
| where TimeGenerated > ago(Window)
| where KubernetesNamespace == Namespace
| where LogMessage has_any ("error", "failed", "timeout", "dependency", "ready", "health")
| project TimeGenerated, Source="application", PodName, KubernetesNamespace, LogMessage;
ControllerLogs
| union AppLogs
| order by TimeGenerated desc The quick read is direct: controller logs without app logs point to routing, service or endpoints; app logs without controller errors point to workload or dependency behavior; no signal in either view sends the diagnosis back to gateway, DNS or the caller network.
Keep rollback bounded
Rollback should target the layer that changed. If the ingress rule changed, revert the ingress. If the service selector changed, restore the selector. If a deployment created unready pods, roll back that deployment. Avoid a broad rollback that hides the broken layer.
NS=prod
DEPLOYMENT=orders-api
HOSTNAME=api.internal.example.com
CORRELATION_ID="ops-$(date +%Y%m%d%H%M%S)"
kubectl -n "$NS" rollout history deployment "$DEPLOYMENT"
kubectl -n "$NS" rollout undo deployment "$DEPLOYMENT"
kubectl -n "$NS" rollout status deployment "$DEPLOYMENT" --timeout=180s
curl -vk "https://$HOSTNAME/health" -H "x-correlation-id: $CORRELATION_ID"
echo "validate_correlation_id=$CORRELATION_ID" If the rollback fixes the call, keep the evidence: previous replica set, image reference, ingress/service state, correlation ID and validation time. Without that trail, the next incident will restart from guesswork.
Conclusion
Private AKS ingress is reliable only when the path is readable from the network edge to the pod. DNS, Application Gateway, ingress controller, Kubernetes service endpoints, readiness and application logs must be separated before any fix is chosen.
The practical rule is simple: do not change deployments until the request has reached the cluster, do not change WAF or gateway rules until the controller path is visible, and do not roll back an image when the service has no endpoints. With that discipline, a private AKS incident becomes a bounded diagnosis instead of a noisy deployment emergency.