Freifunk Frankfurt:Arp-stability
Aus wiki.freifunk.net
Zur Navigation springenZur Suche springenSymptom
user perspective
- The connection between Router and Servers seems to break down
- IP-addresses are not assigned to client computers
- the effect lasts for a couple of minutes and then resolves itself.
server perspective
- This is the total amount of clients in the network based on the alfred-data. The other graphs is the amount of arp-entries on the batbridge-interface that resides within the IP-range of each fastd-Server.
- for fastd1 and fastd4 there are visible cutoffs when reaching around 80 clients. The values for fastd2 and fastd3 (kernel 3.2) seem stable.
- fastd5 does not hand out IP-addresses at the moment so the value is bogus.
- The drop-offs do not happen after a given interval but seem to be correlated to client-count:
- Affected debian-kernels: 3.16, 4.1, confirmed in vanilla kernels as well - see below.
- The issue seems to be load-induced as can be seen by the much more frequent dropoffs on fastd5 when more clients are connected.
- realtime-view on the graphs: http://sstats.ffm.freifunk.net/host.php?h=uber.ffm.freifunk.net&p=exec
server logs
There are no logs that are created around the time. We are however seeing this on fastd1 but not on fastd2:
[422870.439430] batbridge: Multicast hash table chain limit reached: bat0
Analysis
things we looked at
- openvpn: the openvpn-tunnels are still active when this happens
- Systems running on kernel 3.2 seem fine both x64 and i686
- kernel 4.1 is affected as well
data on the servers
fastd1
- Linux fastd1 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u4 (2015-09-19) x86_64 GNU/Linux
- hosted by Hetzner
fastd2
- Linux fastd2.ffm.freifunk.net 3.2.0-4-686-pae #1 SMP Debian 3.2.65-1+deb7u2 i686 GNU/Linux
- hosted by Datafabrik
fastd3
- Linux fastd3 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
- hosted by Hetzner
fastd5
- Linux fastd5.ffm.freifunk.net 4.1.0-0.bpo.2-amd64 #1 SMP Debian 4.1.6-1~bpo8+1 (2015-09-09) x86_64 GNU/Linux
- hosted by Contabo
things we are currently looking at
- Bisecting: 89441 revisions left to test after this (roughly 17 steps) 240c3c3424366c8109babd2a0fe80855de511b35 - this clearly shows the symptom
- Bisecting: 45036 revisions left to test after this (roughly 16 steps) 4ff63e47f7b9dbd72031c364db44526b3c295591 - this shows an entirely different symptom. There are no dropoffs visible like on 3.16, 3.9 and 4.1 but the arp table-size is much smaller than expected. I guess there are no cutoffs visible because of the timely resolution of the data (snapshot taken every minute). As can be seen in the graph, I manually shifted load away from fastd2 to increase load on fastd6, the test-machine however the load did not increase as excpected => we are seeing another symptom different from the one we are investigating on this page. The arp-table entries rotate very quickly, meaning that arp-caching is pretty much ineffective in this commit.
- Bisecting: 22499 revisions left to test after this (roughly 15 steps) [2b8318881ddbcb67c5e8d2178b42284749442222] Merge tag 'fbdev-for-3.8' of git://gitorious.org/linux-omap-dss2/linux
git bisect of vanilla kernel
- affected:
- 19583ca584d6f574384e17fe7613dfaeadcdc4a6 (3.16)
- 240c3c3424366c8109babd2a0fe80855de511b35 (3.9)
- unaffected:
- 805a6af8dba5dfdd35ec35dc52ec0122400b2610 (3.2)
- 4ff63e47f7b9dbd72031c364db44526b3c295591 (3.6) - apparently broken but in other ways...
clues
- commit 54951194656e4853e441266fd095f880bc0398f3 changes the arp-behavior.
References
similar symptom: [1]