Infrastructure
Troubleshooting an Azure Linux VM that cannot join Active Directory
A practical diagnostic procedure for an Azure Linux VM that fails to join Active Directory, with DNS, routing, ports, Kerberos, realmd, and SSSD checks.
An Azure Linux VM that cannot join Active Directory should not be diagnosed by retrying realm join with random options. In an Azure spoke, the failure can come from VNet DNS, a route to the hub, a firewall, an NSG, the wrong resolver, clock drift, the join account, or incomplete SSSD configuration. The most reliable method is to follow the real chain in order.
The scenario is straightforward. A Linux VM is deployed in Azure. Active Directory is located in a hybrid network or in a hub. The VM must join the domain so that AD groups can be used for system administration. The join fails with a Kerberos error, LDAP error, domain not found message, or timeout. The goal is to identify where the chain breaks before changing local configuration.
1. Check what the VM actually uses
The first step is to collect the resolver, routes, and hostname seen by the operating system. In Azure, VNet settings, cloud-init, NetworkManager, or systemd-resolved can produce a result that differs from what was expected.
ip route show
cat /etc/resolv.conf
hostname -f || true If the VM uses 168.63.129.16, it relies on Azure-provided DNS. That is not automatically wrong, but internal AD resolution must then be designed elsewhere. If it uses private IP addresses, those IPs must match the intended internal DNS servers or forwarders.
2. Validate Active Directory DNS
A domain join depends on SRV records. Resolving only the domain name or one domain controller is not enough.
DNS="10.10.0.10"
DOMAIN="corp.example.local"
nslookup -type=SRV _ldap._tcp.$DOMAIN $DNS
nslookup -type=SRV _kerberos._tcp.$DOMAIN $DNS If SRV records do not answer, there is no point retrying the join. The problem is DNS first. The correction may involve the VNet resolver, a conditional forwarder, an Azure DNS Private Resolver rule, or domain controller DNS registration.
3. Test the required network paths
When DNS works, connectivity to a domain controller must be checked. The first ports to validate are DNS, Kerberos, LDAP, SMB, and Kerberos password change.
DC="dc01.corp.example.local"
nc -vz $DC 53
nc -vz $DC 88
nc -vz $DC 389
nc -vz $DC 445
nc -vz $DC 464 A timeout often points to a missing route, UDR, NSG, hub firewall, or asymmetric return path. An immediate refusal points more toward a closed service or explicit filtering.
4. Check time before Kerberos
Kerberos does not tolerate significant clock drift. A VM with an incorrect time can return authentication errors even when the account is valid.
timedatectl status
chronyc tracking 2>/dev/null || true If time is wrong, fix synchronization before continuing. Changing SSSD in that state only adds noise.
5. Test Kerberos without realmd
Before joining, kinit can confirm whether the VM can obtain a Kerberos ticket. This is more precise than realm join.
REALM="CORP.EXAMPLE.LOCAL"
USER="join-user"
kinit $USER@$REALM
klist
kdestroy Cannot find KDC often points to DNS SRV records or routing. A preauthentication failure points to the account, password, or Kerberos policy. A clock skew error points to time.
6. Validate realmd and domain discovery
After DNS, ports, and Kerberos are validated, the join tools become useful to inspect.
DOMAIN="corp.example.local"
realm discover $DOMAIN
adcli info $DOMAIN If adcli info works but realm discover fails, check packages and realmd behavior. If both fail while DNS and Kerberos are good, inspect LDAP and distribution dependencies.
7. Run and validate the join
The join should use a dedicated account and, when available, the OU intended for Linux machines. After a successful join, validate that the domain appears, SSSD is coherent, and AD users and groups resolve.
DOMAIN="corp.example.local"
TEST_USER="user@$DOMAIN"
TEST_GROUP="linux-admins@$DOMAIN"
realm list
sssctl config-check
sssctl domain-status $DOMAIN
getent passwd "$TEST_USER"
getent group "$TEST_GROUP" If realm list is correct but getent returns nothing, the issue is likely in SSSD, NSS, name format, or group resolution. If groups resolve but final access fails, inspect PAM, SSH configuration, allowed groups, and home directory creation.
Conclusion
The useful sequence is stable: local context, SRV records, AD ports, time, Kerberos, domain discovery, join, then SSSD validation. This order avoids random VM changes while the cause often sits in Azure, the network hub, or internal DNS. A realm join error is only the visible symptom; the useful diagnosis is the one that identifies the broken link precisely.