Freifunk Frankfurt:Arp-stability

Aus wiki.freifunk.net
Zur Navigation springen Zur Suche springen

Symptom

user perspective

  • The connection between Router and Servers seems to break down
  • IP-addresses are not assigned to client computers
  • the effect lasts for a couple of minutes and then resolves itself.

server perspective

  • This is the total amount of clients in the network based on the alfred-data. The other graphs is the amount of arp-entries on the batbridge-interface that resides within the IP-range of each fastd-Server.
  • for fastd1 and fastd4 there are visible cutoffs when reaching around 80 clients. The values for fastd2 and fastd3 (kernel 3.2) seem stable.
  • fastd5 does not hand out IP-addresses at the moment so the value is bogus.
  • The drop-offs do not happen after a given interval but seem to be correlated to client-count:
  • Affected debian-kernels: 3.16, 4.1, confirmed in vanilla kernels as well - see below.

  • The issue seems to be load-induced as can be seen by the much more frequent dropoffs on fastd5 when more clients are connected.

server logs

There are no logs that are created around the time. We are however seeing this on fastd1 but not on fastd2:

[422870.439430] batbridge: Multicast hash table chain limit reached: bat0

Analysis

things we looked at

  • openvpn: the openvpn-tunnels are still active when this happens
  • Systems running on kernel 3.2 seem fine both x64 and i686
  • kernel 4.1 is affected as well

data on the servers

fastd1

  • Linux fastd1 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19) x86_64 GNU/Linux
  • hosted by Hetzner

fastd2

  • Linux fastd2.ffm.freifunk.net 3.2.0-4-686-pae #1 SMP Debian 3.2.65-1+deb7u2 i686 GNU/Linux
  • hosted by Datafabrik

fastd3

  • Linux fastd3 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
  • hosted by Hetzner

fastd5

  • Linux fastd5.ffm.freifunk.net 4.1.0-0.bpo.2-amd64 #1 SMP Debian 4.1.6-1~bpo8+1 (2015-09-09) x86_64 GNU/Linux
  • hosted by Contabo


things we are currently looking at

  • Bisecting: 89441 revisions left to test after this (roughly 17 steps) 240c3c3424366c8109babd2a0fe80855de511b35 - this clearly shows the symptom
  • Bisecting: 45036 revisions left to test after this (roughly 16 steps) 4ff63e47f7b9dbd72031c364db44526b3c295591 - this shows an entirely different symptom. There are no dropoffs visible like on 3.16, 3.9 and 4.1 but the arp table-size is much smaller than expected. I guess there are no cutoffs visible because of the timely resolution of the data (snapshot taken every minute). As can be seen in the graph, I manually shifted load away from fastd2 to increase load on fastd6, the test-machine however the load did not increase as excpected => we are seeing another symptom different from the one we are investigating on this page. The arp-table entries rotate very quickly, meaning that arp-caching is pretty much ineffective in this commit.
  • Bisecting: 22499 revisions left to test after this (roughly 15 steps) [2b8318881ddbcb67c5e8d2178b42284749442222] Merge tag 'fbdev-for-3.8' of git://gitorious.org/linux-omap-dss2/linux

git bisect of vanilla kernel

  • affected:
    • 19583ca584d6f574384e17fe7613dfaeadcdc4a6 (3.16)
    • 240c3c3424366c8109babd2a0fe80855de511b35 (3.9)


  • unaffected:
    • 805a6af8dba5dfdd35ec35dc52ec0122400b2610 (3.2)
    • 4ff63e47f7b9dbd72031c364db44526b3c295591 (3.6) - apparently broken but in other ways...

clues

  • commit 54951194656e4853e441266fd095f880bc0398f3 changes the arp-behavior.

References

similar symptom: [1]