Skip to content

Problems with Redundant dhcpd/dns (part 2)

As I mentioned last time, I ran into some challengesincredibly annoying problems setting up a dhcpd failover server.

1. Dhcpd doesn’t record full data for foreign leases.

At first I thought it would be enough to run dhcpd with a failover peer and use djb_update.pl to create the host list for tinydns.  But then I discovered that when the dhcpd master grants a lease, it communicates to the dhcpd slave.  But this communication apparently consists only of the slave’s IP address and ethernet address, and not its hostname.  So it’s not possible to rely on the dhcpd failover protocol to keep the name-IP mappings synchronized on the dhcpd servers.

2. Dhcpd servers race to respond.

Like little virtual fire engines and paramedics, the dhcpd servers race to answer a poor IPless host’s cry for help.  It took me a while to see this one because taurus and draco were initially on the same segment.  Then we did a server move and got hung up halfway, with taurus in its new location but draco still in the old one.  Then all of a suddent a bunch of hosts disappeared from dns.

The problem appears to be that the nearest dhcpd server usually wins in assigning IP addresses, and the way dhcpd failover is implemented is a lot more like load-balancing than I would have thought.  In my opinion, the failover peer should be quiet unless its peer is down (which it is aware of as part of the failover protocol).  In any case, the failover server is live and ready to fill requests, and if it happens to be the nearest server to a given host, it gives out the IP address.

So it’s not enough to be able to get a host list from the failover peer if the main server is down.  At any time, some IPs may be assigned from the master host and some from the failover host.

3. DHcpd doesn’t put hardcoded hosts into its leases file.

This is arguably not a design defect in dhcpd — rather, I can’t imagine that this was an accident — but it sure is annoying.  When you assign a permanent IP to a host using this syntax:

host foobar { hardware ethernet de:ad:be:ef:ca:fe; fixed-address 10.1.1.15; }

then the host suddenly disappears from DNS.

So I managed to briefly lose taurus, my secondary dhcpd server.  Because taurus would be the backup source of internal DNS data, I decided to fix its IP address.  (dnscache can’t accept hostnames in its servers file, for the obvious reason.)  But I neglected to add an entry for taurus in the static host file, and when the old lease expired, dhcpd did not record a new one.  Even though taurus was still getting all its host configuration information from dhcp, including IP address, routing, name servers, etc., dhcpd did not consider it to have a lease since its address could not expire.  Since taurus did not have a valid lease in dhcpd.leases, the djb_update script didn’t put an entry for taurus into the dhcp file.

The most annoying thing about this one is that in order to fix the IP address of a host it’s not enough to change the dhcpd.conf; you also better remember to change the static host table.  And the penalty for error is a delayed failure, because the host won’t disappear from DNS until the old lease expires.

4. djb_update lets the last live lease it reads win.

This is a reasonable design decision since djb_update expects to run on one servers dhcpd.leases file, not on the concatenation of two servers’ files.  But it means that if a host gets a lease from the failover server and then the master recovers, if we manually renew the lease and now get a response from the master, there appear to be two live leases for the host.  (I’d call that a bug in the dhcp protocol, that the host can abandon the lease without notifying the failover server.)

Sinec djb_update doesn’t expect that to happen, it just runs through the leases file making a name -> IP hash using all the valid leases.  If it sees two valid leases for the same name, it uses whichever lease it sees last.  That means that order matters when concatenating the lease files.

A more robust solution would be to honor the most recently written active lease; to add another condition when updating the lease data that the start time of the lease under consideration must be greater than the start time of the saved lease.

This is one of those problems where it takes more time to complain about it than to fix it.  I plan to hack this into my already-hacked-up djb_update.pl script.

Next time: the file layout, Makefiles, scripts, duct tape and baling wire necssary to make this thing work.