Cloud

Azure AKS: diagnose private ingress before changing deployments

Build an operational runbook for AKS private ingress failures by separating DNS, Application Gateway, ingress controller, Kubernetes service endpoints, pod readiness and rollback evidence.

10 Jun 2026 azureakskubernetesprivate-endpointdnskqllogsmonitoringrunbookrollbackautomation

A private AKS application can fail in several places before the request reaches the container. The hostname may still resolve publicly, Application Gateway may probe the wrong path, the ingress controller may not receive the route, the Kubernetes service may have no endpoints, or the pods may be unready after a deployment. From the outside, many of these situations look like the same 502, timeout or empty response.

The use case is an internal web API hosted on AKS. Internal clients call api.internal.example.com, traffic enters through Application Gateway or another private edge, then reaches an ingress controller in the cluster. A release has just changed an ingress rule, service selector, readiness probe or deployment image. Before scaling pods, changing WAF rules or rolling back blindly, the runbook must prove where the request stops.

Draw the private AKS path before touching Kubernetes

AKS adds a cluster layer between the private network and the workload. That layer is useful only if it is observable: DNS, gateway, ingress controller, service, endpoint slices and pod readiness need to be read as one path.

text aks-private-ingress-path.txt
Internal client
Resolves api.internal.example.com
Calls the expected private hostname

Private edge
Application Gateway, private load balancer or internal reverse proxy
Preserves the expected host header
Runs health probes and optional WAF checks

AKS ingress controller
Receives the ingress rule for the hostname and path
Routes traffic to a Kubernetes service

Kubernetes service
Selects pods through labels
Publishes endpoint slices only when pods are ready

Pods
Pass readiness probes
Emit application logs
Reach private dependencies through DNS, identity and network policy

This map prevents a common operational mistake: treating an ingress failure as an image problem. If no endpoint exists behind the service, the deployment may be healthy but the selector is wrong. If gateway probes fail before the controller sees traffic, changing the pod will not help.

Separate gateway, ingress and service symptoms

The first diagnostic step is classification. A private DNS problem, a gateway health problem, a missing ingress rule and an empty Kubernetes service do not have the same fix.

text aks-private-ingress-symptoms.txt
Symptom
Hostname resolves to a public or unexpected address
  Check private DNS zone, VNet links, resolver forwarding and custom domain target

Application Gateway returns 502
  Check backend health, host header, TLS/SNI and probe path toward the ingress controller

Ingress controller logs show no request
  Check gateway routing, NSG, private load balancer and controller service address

Ingress controller logs route errors
  Check ingress class, host, path, TLS secret and backend service name

Kubernetes service has no endpoints
  Check selectors, readiness probes, namespace and pod labels

Pods receive traffic but dependency fails
  Check managed identity, private DNS, network policy and downstream firewalls

The useful question is not only “are pods running?”. It is “does the exact private hostname reach the expected ingress rule, service endpoints and ready pods?”.

Prove DNS and TLS from the caller network

Start from the same network as the real caller or from a diagnostic runner attached to that network. The goal is to capture the hostname, final address and certificate before entering the cluster.

bash 01-aks-private-dns-tls-check.sh
HOSTNAME=api.internal.example.com

nslookup "$HOSTNAME"
dig +short "$HOSTNAME"

ip=$(dig +short "$HOSTNAME" | tail -n 1)
case "$ip" in
10.*|172.16.*|172.17.*|172.18.*|172.19.*|172.2*|172.30.*|172.31.*|192.168.*)
  echo "private_resolution_ok=$ip"
  ;;
*)
  echo "unexpected_public_or_empty_resolution=$ip"
  exit 2
  ;;
esac

openssl s_client -connect "$HOSTNAME:443" -servername "$HOSTNAME" </dev/null 2>/dev/null | openssl x509 -noout -subject -issuer

If this fails, stay outside Kubernetes. Fix private DNS, forwarding, zone links, gateway listener configuration or certificate binding first. The cluster cannot repair a hostname that resolves to the wrong place.

Read Kubernetes objects in dependency order

Inside the cluster, avoid jumping directly to pod logs. Read the route objects in the same order as traffic uses them: ingress, service, endpoints, pods and recent events.

bash 02-aks-ingress-service-endpoints.sh
NS=prod
INGRESS=orders-api
SERVICE=orders-api
APP_LABEL=orders-api

kubectl -n "$NS" describe ingress "$INGRESS"
kubectl -n "$NS" get ingress "$INGRESS" -o wide

kubectl -n "$NS" describe service "$SERVICE"
kubectl -n "$NS" get endpointslice -l kubernetes.io/service-name="$SERVICE" -o wide

kubectl -n "$NS" get pods -l app="$APP_LABEL" -o wide
kubectl -n "$NS" describe pods -l app="$APP_LABEL" | egrep -i 'Ready|Readiness|Warning|Failed|Unhealthy|Back-off'

A service with no endpoints usually means labels or readiness are wrong. An ingress pointing to the wrong service means the controller configuration is wrong. Pods that are ready but return dependency errors move the diagnosis toward identity, DNS or downstream services.

Correlate ingress controller logs with pod logs

The ingress controller often contains the best clue: upstream unavailable, backend not found, TLS mismatch, timeout, path mismatch or no endpoints. Read it with the application logs over the same incident window.

kusto 03-aks-private-ingress-triage.kql
let Window = 2h;
let Namespace = "prod";
let Host = "api.internal.example.com";
let ControllerLogs =
ContainerLogV2
| where TimeGenerated > ago(Window)
| where KubernetesNamespace has_any ("ingress", "ingress-nginx", "appgw")
| where LogMessage has_any (Host, "upstream", "no endpoints", "502", "timeout", "connect", "service")
| project TimeGenerated, Source="ingress-controller", PodName, KubernetesNamespace, LogMessage;
let AppLogs =
ContainerLogV2
| where TimeGenerated > ago(Window)
| where KubernetesNamespace == Namespace
| where LogMessage has_any ("error", "failed", "timeout", "dependency", "ready", "health")
| project TimeGenerated, Source="application", PodName, KubernetesNamespace, LogMessage;
ControllerLogs
| union AppLogs
| order by TimeGenerated desc

The quick read is direct: controller logs without app logs point to routing, service or endpoints; app logs without controller errors point to workload or dependency behavior; no signal in either view sends the diagnosis back to gateway, DNS or the caller network.

Keep rollback bounded

Rollback should target the layer that changed. If the ingress rule changed, revert the ingress. If the service selector changed, restore the selector. If a deployment created unready pods, roll back that deployment. Avoid a broad rollback that hides the broken layer.

bash 04-aks-bounded-rollback.sh
NS=prod
DEPLOYMENT=orders-api
HOSTNAME=api.internal.example.com
CORRELATION_ID="ops-$(date +%Y%m%d%H%M%S)"

kubectl -n "$NS" rollout history deployment "$DEPLOYMENT"
kubectl -n "$NS" rollout undo deployment "$DEPLOYMENT"
kubectl -n "$NS" rollout status deployment "$DEPLOYMENT" --timeout=180s

curl -vk "https://$HOSTNAME/health" -H "x-correlation-id: $CORRELATION_ID"
echo "validate_correlation_id=$CORRELATION_ID"

If the rollback fixes the call, keep the evidence: previous replica set, image reference, ingress/service state, correlation ID and validation time. Without that trail, the next incident will restart from guesswork.

Conclusion

Private AKS ingress is reliable only when the path is readable from the network edge to the pod. DNS, Application Gateway, ingress controller, Kubernetes service endpoints, readiness and application logs must be separated before any fix is chosen.

The practical rule is simple: do not change deployments until the request has reached the cluster, do not change WAF or gateway rules until the controller path is visible, and do not roll back an image when the service has no endpoints. With that discipline, a private AKS incident becomes a bounded diagnosis instead of a noisy deployment emergency.