♥ Gluttony, ♥ weird packets, ♥ kernel bugs etc...
Personally, I can tackle most of the common Linux problems, but then, I get punched to the face from a completely unexpected direction. Kernel issues are the ones that are the hardest for me to debug, and when one happens, it always takes me whole day to come to that magical conclusion…
I ♥ our admin team
For those who don’t know yet, I’m also one of network admins at Sinkuleho dormitory (next to ČVUT FEL in Prague, Dejvická). Our little admin team takes care of total of 8 physical servers, plus a pack of switches, wireless network and stuff, and from the software side of things, I manage most of the monitoring.
I ♥ NtopNG
So… I needed to upgrade our Ntop installation. Ntop is a neat flow collection and monitoring software. Just pour the network data in and watch the awesomeness coming out. Mainly our local<->remote traffic goes through that thing.
Since I’ve had some issues with Redis getting filled after a few days, I decided to upgrade NtopNG from stable to current Git HEAD. First compilation attempt was unsuccessful, but the very next day Compilation fix appeared in the repo and everything went smoothly afterwards.
Until I decided to run it.
I 😕 weird errors
Heeey that new error I never ever had, how are you? Come in, come in! Make yourself at home! Wanna some tea? So, how’s life? You’re feeling fine? That’s amazing, I’m feeling freaking betrayed by you!
^^ my feelings when ntop crashed after a few seconds of running. When I finally run it in the foreground, this tiny error creeped out:
ntopng supports only batman-adv version 2013 and 2014.
I mean, what?! I don’t even know that! So I went to the Internet if anyone had the same issue. It turned out that this piece of code was inserted into ntopng just a few weeks ago. The idea of reverting that patch was very tempting. But then, again, I don’t use Batman packets and I have no idea how they got there!
Next step: tcpdump. See if I can trace the source.
I ♥ packets with mangled types
And that’s where the whole madness began. Quick peek into the network traffic suggested that Gluttony, our dearest data storage server, was to blame. I saw a plethora of unknown Ethernet types and was worried that the network adapter had died.
The dump above was taken from our monitoring port. The structure of the alien packets suggested that only protocol (type/length) has changed, and inside was regular TCP connection - continuation of the stream above. And tcpdump from Gluttony showed the exactly same thing. Funny thing is, it started happening only after a few thousand packets.
We also have several VLANs running on our network and the monitoring port is combining them using 802.1q trunk. So I found it weird that the packet type was suddenly so different.
Maybe the kernel went nuts? Improbable, but possible, so the first thing I restarted network interfaces. Of course it didn’t help. Then I rebooted the machine. But I still had the weird packets in my logs. Interestingly enough, the download (and network traffic) was working pretty much fine, even with this weirdness. It was the first clue.
I ♥ diversity
Luckily, I still have both Windows and Linux on my notebook, so I went to capture with that. I’ve connected monitoring directly to my computer, and the weird Ethernet types disappeared. At the same time, the problem persisted on both servers.
So I tried to capture on Pride, our Linux router, and I also didn’t find the issue!
Tcpdump and libpcap versions? Nah, they’re all the same. Not when comparing Windows and Ubuntu, but Pride, Envy and Gluttony had the same, and Pride didn’t have the issue. That’s when it hit me.
I 😕 old kernels
Our router was the only machine that had new, 3.18 kernel, while the others were stuck with 3.10.41. Was it really that? I put it to the test.
And with the new 3.18 kernel, the issue disappeared! :-) Even ntop was able to start without any issues this time.
I didn’t manage to find the particluar patch that fixed the bug yet, though I’m sure it was pretty soon after 3.10.41. What a nasty bug that is! But not the only one, I had some issues with IPv6 multicast as well which 3.18 fixed. Perhaps we were just lucky.
What made this bug interesting is that, after some research, I found that only the packet capture was affected, not the actual communication, which was a relief, at least partially. Still, it’s nice to have it fixed. Until next bug, that is. ;-)