Cloud
Azure Service Bus: diagnose a private endpoint before touching queues
An operational runbook for Azure Service Bus private access incidents by separating DNS, Private Endpoint, identity, firewall, metrics, logs and rollback.
An Azure Service Bus private incident rarely looks like a single failure. A publisher may see timeouts, a consumer may stop receiving messages, an application may return 401 or 403, and metrics may show a growing queue with no clear code-side error. When the namespace is exposed through a Private Endpoint, the right reaction is not to redeploy the worker or change the queue blindly.
The use case is an internal application that publishes or consumes messages from a private network: App Service, Function, AKS, Container Apps, VM or automation runner. The runbook goal is to prove whether the incident comes from the private path, identity, network rules, Service Bus configuration or application processing.
Read Service Bus as a messaging path
Service Bus adds a business layer to network diagnosis. Knowing whether the namespace responds is not enough. You also need the operation type, target entity, identity in use and effect on backlog.
Workload
Publisher, consumer, Function trigger, worker or pipeline
FQDN resolution from the same network as the workload
Identity or SAS actually used by the application
Private DNS
namespace.servicebus.windows.net must follow privatelink.servicebus.windows.net
The final answer must be a private address from the expected VNet
Service Bus platform
Private Endpoint approved
Public network access aligned with the security target
Queue, topic, subscription and dead-letter queue in the expected state
Evidence
Test timestamp
Operation sent or received
ActiveMessages, DeadletteredMessages, IncomingRequests metrics
Logs and application errors This view avoids two common shortcuts: increasing permissions while the name still resolves publicly, or purging a queue while the consumer simply no longer reaches the namespace.
Classify the symptom before acting
The error message must always be read with the test location. Missing consumption can come from the network, a message lock, a subscription filter, an identity without role or application processing that fails after receiving the message.
Observed symptom
Timeout or unresolved name
Check private DNS, Private Endpoint, routing and local firewall
401 or 403
Check real identity, Service Bus Data Sender/Receiver/Owner RBAC or SAS
Messages published but not consumed
Check active consumer, subscription, filter, lock duration, dead-letter and application errors
No Service Bus log line
Go back to FQDN, public endpoint, source network or connection string
Backlog grows after Terraform change
Compare Private Endpoint, publicNetworkAccess, role assignments and deployed entities The operating rule is stable: until the Service Bus FQDN resolves privately from the consumer network, a queue or code fix is premature.
Test from the right network
The test must run from a point that shares the workload DNS and network path. A local VPN workstation can help, but it does not replace a diagnostic VM, pod, private runner or application subnet.
RG=rg-prod-messaging
NAMESPACE=sb-prod-orders
QUEUE=orders-in
HOSTNAME="$NAMESPACE.servicebus.windows.net"
nslookup "$HOSTNAME"
dig +short "$HOSTNAME"
az servicebus namespace show -g "$RG" -n "$NAMESPACE" --query "{name:name, publicNetworkAccess:publicNetworkAccess, provisioningState:provisioningState}" -o jsonc
SB_ID=$(az servicebus namespace show -g "$RG" -n "$NAMESPACE" --query id -o tsv)
az network private-endpoint-connection list --id "$SB_ID" --query "[].{name:name,status:privateLinkServiceConnectionState.status,description:privateLinkServiceConnectionState.description}" -o table
az servicebus queue show -g "$RG" --namespace-name "$NAMESPACE" -n "$QUEUE" --query "{status:status, active:countDetails.activeMessageCount, deadletter:countDetails.deadLetterMessageCount, lockDuration:lockDuration}" -o jsonc These commands do not prove the application consumes correctly, but they quickly separate an unavailable namespace, unapproved Private Endpoint, inconsistent public access and a disabled or already saturated queue.
Verify the real identity
In production, Service Bus is often consumed with a managed identity, service principal or legacy SAS. Diagnosis must identify the identity actually used by the runtime, not the one expected in the architecture.
SB_ID=$(az servicebus namespace show -g "$RG" -n "$NAMESPACE" --query id -o tsv)
PRINCIPAL_ID=<workload-principal-id>
az role assignment list --assignee "$PRINCIPAL_ID" --scope "$SB_ID" --query "[].{role:roleDefinitionName,scope:scope}" -o table
az servicebus namespace authorization-rule list -g "$RG" --namespace-name "$NAMESPACE" -o table If the workload still uses SAS, the runbook must make it visible. An expired SAS or one copied from an old namespace creates a very different incident from a managed identity missing Azure Service Bus Data Receiver.
Correlate backlog, errors and denials
Metrics show the incident shape: growing backlog, rising dead-letter count, missing incoming requests or namespace-side errors. Logs and application traces then provide the cause.
let Window = 2h;
let Namespace = "sb-prod-orders";
AzureDiagnostics
| where TimeGenerated > ago(Window)
| where ResourceProvider == "MICROSOFT.SERVICEBUS"
| where Resource has Namespace
| where OperationName has_any ("Send", "Receive", "Complete", "Abandon", "DeadLetter", "RenewLock")
or ResultType !in ("Success", "Succeeded")
| project TimeGenerated, Resource, OperationName, ResultType, ResultDescription, ActivityId, CallerIPAddress, Identity=tostring(Identity)
| order by TimeGenerated desc Quick read: no line for the test points back to DNS, Private Endpoint or connection string; 401/403 points to identity or SAS; errors after Receive point to lock, dead-letter or application processing.
Choose a bounded rollback
Rollback should not erase evidence or hide the origin. It should restore the layer that changed: DNS, Private Endpoint, identity, queue configuration or application version.
Recent change
Private DNS or zone link
Rollback: restore the previous link or forwarding path
Evidence: FQDN resolves privately from the workload
Private Endpoint or publicNetworkAccess
Rollback: return to the previous network state
Evidence: message sent or received with visible Service Bus logs
Role assignment or SAS
Rollback: restore the previous identity or key only for the agreed window
Evidence: same operation succeeds without broader rights
Worker, Function or queue configuration
Rollback: return to the previous application version
Evidence: backlog stabilizes and dead-letter no longer grows Conclusion
An Azure Service Bus incident on a private network should be handled as a complete chain: DNS, Private Endpoint, public exposure, identity, messaging entity, metrics, logs and rollback. This method avoids confusing a private path failure with a consumer bug, or an identity denial with a blocked queue.
The decision becomes simpler: fix DNS when the namespace goes public, Private Endpoint when the private path is missing, RBAC or SAS when the call reaches Service Bus but fails, and the application only when platform evidence is clean.