Freifunk Frankfurt:Arp-stability

Aus wiki.freifunk.net
Zur Navigation springenZur Suche springen

Symptom

user perspective

  • The connection between Router and Servers seems to break down
  • IP-addresses are not assigned to client computers
  • the effect lasts for a couple of minutes and then resolves itself.

server perspective

Arp-tablesize.png

  • This is the total amount of clients in the network based on the alfred-data. The other graphs is the amount of arp-entries on the batbridge-interface that resides within the IP-range of each fastd-Server.
  • for fastd1 and fastd4 there are visible cutoffs when reaching around 80 clients. The values for fastd2 and fastd3 (kernel 3.2) seem stable.
  • fastd5 does not hand out IP-addresses at the moment so the value is bogus.
  • The drop-offs do not happen after a given interval but seem to be correlated to client-count:
  • Affected debian-kernels: 3.16, 4.1, confirmed in vanilla kernels as well - see below.

Arp-tablesize-day.png

  • The issue seems to be load-induced as can be seen by the much more frequent dropoffs on fastd5 when more clients are connected.

Arp-stability-load.png

server logs

There are no logs that are created around the time. We are however seeing this on fastd1 but not on fastd2:

[422870.439430] batbridge: Multicast hash table chain limit reached: bat0

Analysis

things we looked at

  • openvpn: the openvpn-tunnels are still active when this happens
  • Systems running on kernel 3.2 seem fine both x64 and i686
  • kernel 4.1 is affected as well

data on the servers

fastd1

  • Linux fastd1 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19) x86_64 GNU/Linux
  • hosted by Hetzner

fastd2

  • Linux fastd2.ffm.freifunk.net 3.2.0-4-686-pae #1 SMP Debian 3.2.65-1+deb7u2 i686 GNU/Linux
  • hosted by Datafabrik

fastd3

  • Linux fastd3 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
  • hosted by Hetzner

fastd5

  • Linux fastd5.ffm.freifunk.net 4.1.0-0.bpo.2-amd64 #1 SMP Debian 4.1.6-1~bpo8+1 (2015-09-09) x86_64 GNU/Linux
  • hosted by Contabo


things we are currently looking at

  • Bisecting: 89441 revisions left to test after this (roughly 17 steps) 240c3c3424366c8109babd2a0fe80855de511b35 - this clearly shows the symptom
  • Bisecting: 45036 revisions left to test after this (roughly 16 steps) 4ff63e47f7b9dbd72031c364db44526b3c295591 - this shows an entirely different symptom. There are no dropoffs visible like on 3.16, 3.9 and 4.1 but the arp table-size is much smaller than expected. I guess there are no cutoffs visible because of the timely resolution of the data (snapshot taken every minute).Arp-brokenness-4ff63e47f7b9dbd72031c364db44526b3c295591.png As can be seen in the graph, I manually shifted load away from fastd2 to increase load on fastd6, the test-machine however the load did not increase as excpected => we are seeing another symptom different from the one we are investigating on this page. The arp-table entries rotate very quickly, meaning that arp-caching is pretty much ineffective in this commit.
  • Bisecting: 22499 revisions left to test after this (roughly 15 steps) [2b8318881ddbcb67c5e8d2178b42284749442222] Merge tag 'fbdev-for-3.8' of git://gitorious.org/linux-omap-dss2/linux

git bisect of vanilla kernel

  • affected:
    • 19583ca584d6f574384e17fe7613dfaeadcdc4a6 (3.16)
    • 240c3c3424366c8109babd2a0fe80855de511b35 (3.9)


  • unaffected:
    • 805a6af8dba5dfdd35ec35dc52ec0122400b2610 (3.2)
    • 4ff63e47f7b9dbd72031c364db44526b3c295591 (3.6) - apparently broken but in other ways...

clues

  • commit 54951194656e4853e441266fd095f880bc0398f3 changes the arp-behavior.

References

similar symptom: [1]