Linux-AD Integration Issue: Persistent Connections

A reader contacted me a short while ago to inquire about a problem he was having with his Linux-AD integration efforts. It seems he had recently added a new domain controller (DC) that was intended to be a DC for a disaster recovery (DR) site. When he took this new DR DC offline in order to physically move it to the DR site, some of his AD-integrated Linux systems started failing to authenticate. More specifically, Kerberos continued to work, but LDAP lookups failed. When the reader would bring the DR DC back online, those systems started working again.

There was a clear correlation between the DR DC and the AD-integrated Linux systems, even though the /etc/ldap.conf file specifically pointed to another DC by IP address. There was no reference whatsoever, by IP address or host name, to the DR DC. Yet, every time the DR DC was taken offline, the behavior returned on a subset of Linux hosts. The only difference we could find between the affected and unaffected hosts was that the affected hosts were not on the same VLAN as the production domain controllers.

I theorized that Windows’ netmask ordering feature, which prioritizes the return of DNS lookups to provide clients with addresses that are “closer” to them, was playing a role here. However, the /etc/ldap.conf was using IP addresses, not the domain name or even the fully qualified domain name of a DC. It couldn’t be DNS, at least not as far as I could tell.

Upon further investigation, the reader discovered that the affected Linux servers—those that were on a different VLAN than both the production DCs as well as the DR DC—were maintaining persistent connections to the DR DC. (He found this via netstat.) When the DR DC went offline, the affected Linux hosts tried to continue to communicate to that DC and that DC only. Once the reader was able to get the affected Linux hosts to drop that persistent connection, he was able to take the DR DC offline and the Linux hosts worked as expected.

So, the real question now becomes: how (or why) did the Linux servers connect to the DR DC instead of the production DC for which they were configured? I think that Active Directory issued an LDAP referral to direct the affected Linux servers to the DR DC as a result of site topology. Perhaps due to an incorrect or incomplete site topology configuration, Active Directory believed the DR DC should handle the VLANs where the affected Linux servers resided. If that is indeed the case, the fix would be to make sure that your AD site topology is correct and that subnets are appropriately associated with sites. Of course, this is just a theory.

Has anyone else seen an issue similar to this? What fix were you able to implement in order to correct it?

Tags: , , , , , ,

7 comments

  1. Brian Desmond’s avatar

    AD won’t issue a same domain referral for closeness. You’ll get a referral if say you contact a DC in DomainA looking for DomainB.

  2. AC’s avatar

    I agree with your hypothesis, the subnets need to be defined properly in sights and services.

  3. David Magda’s avatar

    The best way to figure this out would be to get a test client and run tcpdump capturing everything from the LDAP connection (assumes no SSL/TLS). Once you know what AD is telling the Linux system, you can then formulate hypotheses on what Linux is doing with that data.

    What process is / was keeping the persistent connection? What it nscd? (Personally I tend to turn that service off on Linux–causes no end of confusion.)

  4. Stephen P. Schaefer’s avatar

    Wireshark was showing me that every time I got a result (searchResEntry) I’d get three references:

    Item: ldap://ForestDnsZones.internal.example.com/DC=ForestDnsZones,DC=internal,DC=example,DC=com

    Item: ldap://DomainDnsZones.internal.example.com/DC=DomainDnsZones,DC=internal,DC=example,DC=com

    Item: ldap://internal.example.com/CN=Configuration,DC=internal,DC=example,DC=com

    A DNS A record lookup on each of of ForestDnsZones, DomainDnsZones, and internal.example.com would return the IP address of every domain controller we have EVER had – including the dead ones and the ones on the wrong side of the planet (TCP latency!). I suspect that the Microsoft libraries do something smart with service records and “site”s so that they don’t have the same troubles my Unix boxes have, but Unix ldap libraries don’t know anything about that.

    I’ve done two things: put “referrals off” in /etc/ldap.conf, and put a pen proxy server on every LDAP client , along with this entry /etc/hosts:

    127.0.0.1 ForestDnsZones.internal.example.com, DomainDnsZones.internal.example.com, and internal.example.com

    pen does a good job of detecting LDAP server failure and failing over to something more appropriate – it needs to be told who the closest LDAP servers are, but I’ve got a perl script to generate that based on number of hops discovered with traceroute :-).

  5. Stephen P. Schaefer’s avatar

    Oops: that /etc/hosts line doesn’t include commas or “and” ;-).

  6. David Magda’s avatar

    One thing that I’ve noticed is that if you use the standard LDAP port, AD isn’t really giving you may “true” LDAP results–you get a referral off into to the weeds (specifically for the “CN=Configuration” thing).

    If you use the “Global Catalog” port (tcp/3268), then you get back results that most other people would call “normal LDAP”. (Port 3269 is the SSL version.)

    Had this issue with an Apache server once: by default it would try to reach all the referrals, which I didn’t notice at first because things generally worked. However, one day a DC died, and Apache would hang waiting for it (until the time out). This of course caused end-user problems because their web logs would take forever to happen. The solution was to use the Global Catalog port, and referrals were no longer given back as part of a standard search/bind operation.

  7. Slav Pidgorny’s avatar

    Stephen P. Schaefer’s analysis is almost right on target: what observed here is not a result of LDAP referrals but a result of DNS lookups. Therefore I’d add DNS lookup traffic to the capture.

    Maybe not directly related (unless Linux makes use of SRV records) but anyway: it is important to prevent excessive DNS registration of domain controllers, especially in infrastructures with many domain controllers. That’s done using “DnsAvoidRegisterRecords” registry setting – see http://support.microsoft.com/kb/306602.

Comments are now closed.