Cloud

Azure Service Bus: diagnose a private endpoint before touching queues

An operational runbook for Azure Service Bus private access incidents by separating DNS, Private Endpoint, identity, firewall, metrics, logs and rollback.

15 Jun 2026 azureservice-busprivate-endpointdnskqllogsmonitoringrunbookidentityrollbackmessaging

An Azure Service Bus private incident rarely looks like a single failure. A publisher may see timeouts, a consumer may stop receiving messages, an application may return 401 or 403, and metrics may show a growing queue with no clear code-side error. When the namespace is exposed through a Private Endpoint, the right reaction is not to redeploy the worker or change the queue blindly.

The use case is an internal application that publishes or consumes messages from a private network: App Service, Function, AKS, Container Apps, VM or automation runner. The runbook goal is to prove whether the incident comes from the private path, identity, network rules, Service Bus configuration or application processing.

Read Service Bus as a messaging path

Service Bus adds a business layer to network diagnosis. Knowing whether the namespace responds is not enough. You also need the operation type, target entity, identity in use and effect on backlog.

text service-bus-private-path.txt

Workload
Publisher, consumer, Function trigger, worker or pipeline
FQDN resolution from the same network as the workload
Identity or SAS actually used by the application

Private DNS
namespace.servicebus.windows.net must follow privatelink.servicebus.windows.net
The final answer must be a private address from the expected VNet

Service Bus platform
Private Endpoint approved
Public network access aligned with the security target
Queue, topic, subscription and dead-letter queue in the expected state

Evidence
Test timestamp
Operation sent or received
ActiveMessages, DeadletteredMessages, IncomingRequests metrics
Logs and application errors

This view avoids two common shortcuts: increasing permissions while the name still resolves publicly, or purging a queue while the consumer simply no longer reaches the namespace.

Classify the symptom before acting

The error message must always be read with the test location. Missing consumption can come from the network, a message lock, a subscription filter, an identity without role or application processing that fails after receiving the message.

text service-bus-symptoms.txt

Observed symptom
Timeout or unresolved name
  Check private DNS, Private Endpoint, routing and local firewall

401 or 403
  Check real identity, Service Bus Data Sender/Receiver/Owner RBAC or SAS

Messages published but not consumed
  Check active consumer, subscription, filter, lock duration, dead-letter and application errors

No Service Bus log line
  Go back to FQDN, public endpoint, source network or connection string

Backlog grows after Terraform change
  Compare Private Endpoint, publicNetworkAccess, role assignments and deployed entities

The operating rule is stable: until the Service Bus FQDN resolves privately from the consumer network, a queue or code fix is premature.

Test from the right network

The test must run from a point that shares the workload DNS and network path. A local VPN workstation can help, but it does not replace a diagnostic VM, pod, private runner or application subnet.

bash 01-service-bus-private-check.sh

RG=rg-prod-messaging
NAMESPACE=sb-prod-orders
QUEUE=orders-in
HOSTNAME="$NAMESPACE.servicebus.windows.net"

nslookup "$HOSTNAME"
dig +short "$HOSTNAME"

az servicebus namespace show -g "$RG" -n "$NAMESPACE" --query "{name:name, publicNetworkAccess:publicNetworkAccess, provisioningState:provisioningState}" -o jsonc

SB_ID=$(az servicebus namespace show -g "$RG" -n "$NAMESPACE" --query id -o tsv)

az network private-endpoint-connection list --id "$SB_ID" --query "[].{name:name,status:privateLinkServiceConnectionState.status,description:privateLinkServiceConnectionState.description}" -o table

az servicebus queue show -g "$RG" --namespace-name "$NAMESPACE" -n "$QUEUE" --query "{status:status, active:countDetails.activeMessageCount, deadletter:countDetails.deadLetterMessageCount, lockDuration:lockDuration}" -o jsonc

These commands do not prove the application consumes correctly, but they quickly separate an unavailable namespace, unapproved Private Endpoint, inconsistent public access and a disabled or already saturated queue.

Verify the real identity

In production, Service Bus is often consumed with a managed identity, service principal or legacy SAS. Diagnosis must identify the identity actually used by the runtime, not the one expected in the architecture.

bash 02-service-bus-identity-check.sh

SB_ID=$(az servicebus namespace show -g "$RG" -n "$NAMESPACE" --query id -o tsv)
PRINCIPAL_ID=<workload-principal-id>

az role assignment list --assignee "$PRINCIPAL_ID" --scope "$SB_ID" --query "[].{role:roleDefinitionName,scope:scope}" -o table

az servicebus namespace authorization-rule list -g "$RG" --namespace-name "$NAMESPACE" -o table

If the workload still uses SAS, the runbook must make it visible. An expired SAS or one copied from an old namespace creates a very different incident from a managed identity missing Azure Service Bus Data Receiver.

Correlate backlog, errors and denials

Metrics show the incident shape: growing backlog, rising dead-letter count, missing incoming requests or namespace-side errors. Logs and application traces then provide the cause.

kusto 03-service-bus-private-errors.kql

let Window = 2h;
let Namespace = "sb-prod-orders";
AzureDiagnostics
| where TimeGenerated > ago(Window)
| where ResourceProvider == "MICROSOFT.SERVICEBUS"
| where Resource has Namespace
| where OperationName has_any ("Send", "Receive", "Complete", "Abandon", "DeadLetter", "RenewLock")
 or ResultType !in ("Success", "Succeeded")
| project TimeGenerated, Resource, OperationName, ResultType, ResultDescription, ActivityId, CallerIPAddress, Identity=tostring(Identity)
| order by TimeGenerated desc

Quick read: no line for the test points back to DNS, Private Endpoint or connection string; 401/403 points to identity or SAS; errors after Receive point to lock, dead-letter or application processing.

Choose a bounded rollback

Rollback should not erase evidence or hide the origin. It should restore the layer that changed: DNS, Private Endpoint, identity, queue configuration or application version.

text service-bus-rollback.txt

Recent change
Private DNS or zone link
  Rollback: restore the previous link or forwarding path
  Evidence: FQDN resolves privately from the workload

Private Endpoint or publicNetworkAccess
  Rollback: return to the previous network state
  Evidence: message sent or received with visible Service Bus logs

Role assignment or SAS
  Rollback: restore the previous identity or key only for the agreed window
  Evidence: same operation succeeds without broader rights

Worker, Function or queue configuration
  Rollback: return to the previous application version
  Evidence: backlog stabilizes and dead-letter no longer grows

Conclusion

An Azure Service Bus incident on a private network should be handled as a complete chain: DNS, Private Endpoint, public exposure, identity, messaging entity, metrics, logs and rollback. This method avoids confusing a private path failure with a consumer bug, or an identity denial with a blocked queue.

The decision becomes simpler: fix DNS when the namespace goes public, Private Endpoint when the private path is missing, RBAC or SAS when the call reaches Service Bus but fails, and the application only when platform evidence is clean.