The issue was detected while analyzing some application logs, which reported few seconds long spike periods when messages from multiple clients are received on the server with a substantial delay (up to a couple of seconds). The application itself utilizes persistent connections, over which clients and server are exchanging short messages (much less than MTU) a couple of dozens times per second (think voice data/gaming traffic).
In order to dig deeper, I recorded a tcpdump and figured out that random segments from multiple (but not all) clients get lost during those spikes (so the server sends out a lot of SACKs), and the retransmissions happen in about 300ms in best cases, hence the delay on the application level, while the server waits for the missing fragments. For a particular affected client, it's not just one retransmission per spike, but sort of a series of retransmissions. Commands like ifconfig -a
don't report any packet loss, /var/log/syslog
is clean. The channel is 10Gbit, while the incoming/outgoing traffic measures at barely 10Mbit in the peak hours.
The question is: what may cause this, which tools can help in spotting a potential problem, where to look? Can this have to do with the server provider?