dhcpd giving wrong directions

Illustration

This is just one of many software adventures that I’ve encountered while on my wild ride at Sincoolka. Since we don’t have a blog of our own, I’m publishing it here and hope that it sticks. Actually, a colleague and a friend from Silicon Hill (a dormitory club at Strahov, Prague) prompted this text. How? Read on.

DHCP server and gateway

I assume everyone here has heard about DHCP servers. They assign your computers, phones, or smart condoms, an IP address to use in a computer network. That’s not the only thing, though; it usually also gives you the path out of the current network - that’s the default gateway’s IP address.

That gateway is basically your mailman for the outside world. You don’t need to know the full path to your destination; give the data packet to the gateway and it will take care of the rest, and it will also deliver your mail back to you. (I’m skipping asymmetric routing here folks.)

One limitation, generally speaking, is that the gateway must be on the same network as you are. The reason is that for you to get out of your network, there has to be someone in that network, who has connections to the other networks – and outside world in general. And that’s your mailman.

Usually you know the mailman’s name (IP address), but not how they look like (MAC address), and so you shout their name in your network until you two meet (identify each other’s MAC address). That’s how ARP protocol works, very roughly. It could theoretically work outside of your subnet, but it’s not guaranteed.

dhcpd mixes up the mailman

Recently I’ve come across an interesting phenomenon: The DHCP server was assigning the correct IP address to a client, but the wrong default gateway address, which was in a completely different network.

I’m sorry, but what?

First. It was not that recent. We’ve seen some instances of this before. The issue was that the client got its address, but there was still no internet connection. After reserving a different IP address for the client, everything suddenly started working again. For some reason, we didn’t connect the dots back then and did not investigate further, but this alone was pretty suspicious on its own.

One part of this could have been an IP address duplication – two MAC addresses could have been assigned the same IP address. This is a big no-no, yet due to an imperfect system, it used to happen. We fixed that, but still had the issues.

Is it Wi-Fi? Is it something else?

We went into the new semester with some new dhcpd configuration. I have been trying to fix some long-standing issues with our dhcpd configuration that was throwing warnings all around. We were also testing new Wi-Fi 6 access points, which were giving us separate issues with Apple iOS devices not able to connect. So whatever it was at the time, we had to be creative. Luckily for us, our new controller for our Ruckus APs included a troubleshooting tool. With that, we were often able to isolate the root cause of the issues. It kinda sucked that it was mostly the radios :( and people forgetting their passwords… :) And some limited amount of people got a different issue - where the phone connected but did not get internet (IPv4, but we didn’t know that). We could not wrap our head around that one, but a change of IP usually fixed it.

Some months later, for the first time, someone checked further and we were able to see that the iPhone got its IP address, but the gateway was completely wrong - let’s say that the device got an address of 10.20.33.44/24, but the default gateway (router) was 147.32.110.1. WTF? How?

Packet captures confirmed, for a minority of cases, that the gateway was assigned wrongly by the DHCP server. But why?

Hunt for the packets

So, once I knew what the problem was, I began tracing. Even though I could not find anything wrong with the DHCP server itself, I have made some changes. Specifically, I have tried to constrict the static host {} definitions (address reservations) to the subnets, or put our whole network parts into groups. But it has not helped with the gateway assignment being wrong sometimes.

Pretty soon, I have reached a decision to try and replace the DHCP server in our network. If it failed, we could always go back – dhcpd has worked, just not 100%.

Intermission - Was this an isolated incident?

I’ve recently talked about DHCP with a colleague from Silicon Hill, and in the discussion, it has transpired that they had big issues with memory leaks in their dhcpd. They’ve patched it for their own purposes though. Now, it’s not leaking memory, but it’s still leaking some DHCP options – meaning that the options defined for one network are spilling over to other networks. Apparently this was “good enough” at the time, but it is suspiciously similar to what I’ve experienced. After all, default gateway (router) is “just another DHCP option”.

Kea DHCP server

Kea is quite a new beast with two distinct variants: DHCP(v4) and DHCPv6. I have chosen it for a clear, hierarchic configuration in JSON, but also because it allows MySQL and PostgreSQL for reservation and lease databases. (Not so helpful for us, as it would have turned out eventually.)

Interestingly, Kea was not so forgiving about some of our configuration. It turned out that if there was a duplicate IP address for multiple MAC addresses, Kea wouldn’t let us start the daemon. That was a change since dhcpd just chugged along. The cause of the duplicates was an error in the database design of our IS that we have since fixed.

For the time being, we chose to have all the IP reservations in the main configuration file, assigned to specific networks. We then proceeded to restart the daemon every 15 minutes to load the new configuration. And after some testing, it seemed to work well, so we replaced dhcpd and watched for issues.

Duplicated DHCP responses

One slight issue that we came across: On interfaces with multiple IP addresses, Kea would listen on each of them and respond to every DHCP request on that network several times – precisely, for each IP address on the interface, there was one response.

That needed to be tamed by only binding to the specific IP and not the whole interface.

No IPv4 address? Hmm…

Now it worked quite well, but after a week or two, there was something else. Some of our tenants wouldn’t get an IPv4, only after reconnecting they would sometimes get it back. This made no sense, until I tried to restart the Kea daemon manually.

It has just frozen on loading the leases. Upon inspection in /var/lib/kea/ leases file, I have found that it has already amassed several MBs of data in CSV, which were loaded on each launch. After deleting the file, the problem has gone away.

Obviously I had read the documentation before putting this into production. It said that the CSV file was supposed to be cleaned up automatically, and there was an interval set of 3600s, or 1 hour, between cleanups. However, because we restarted the daemon every 15 minutes, the cleanup would never occur. Thus, the lease file has reached gigantic proportions and its loading postponed the launch for a few minutes.

So, after that was known, I have changed the lease cleanup interval to 10 minutes, and later changed the DHCP generation script to only restart the daemon if a change was made to the IP / MAC reservations. This fixed the issue reliably.

Because the outage was only a few minutes, our monitoring never alerted us that something was wrong. But our clients did notice. Eventually, I figured out that the monitoring did test and log the DHCP failures, and I have made the reporting more aggressive.

… and they lived happily ever after!

Title image: https://commons.wikimedia.org/wiki/File:Californiaofframpwrongwaysignage.jpg
The original uploader was Coolcaesar at English Wikipedia., CC BY-SA 3.0 http://creativecommons.org/licenses/by-sa/3.0/, via Wikimedia Commons