0

The issue was detected while analyzing some application logs, which reported few seconds long spike periods when messages from multiple clients are received on the server with a substantial delay (up to a couple of seconds). The application itself utilizes persistent connections, over which clients and server are exchanging short messages (much less than MTU) a couple of dozens times per second (think voice data/gaming traffic).

In order to dig deeper, I recorded a tcpdump and figured out that random segments from multiple (but not all) clients get lost during those spikes (so the server sends out a lot of SACKs), and the retransmissions happen in about 300ms in best cases, hence the delay on the application level, while the server waits for the missing fragments. For a particular affected client, it's not just one retransmission per spike, but sort of a series of retransmissions. Commands like ifconfig -a don't report any packet loss, /var/log/syslog is clean. The channel is 10Gbit, while the incoming/outgoing traffic measures at barely 10Mbit in the peak hours.

The question is: what may cause this, which tools can help in spotting a potential problem, where to look? Can this have to do with the server provider?

10
  • A packet loss can happen at any device in between client and server (i.e. router, firewall, load balancer ....) and also on the server. It is often connected with overload of the specific intermediary or end devices, but might also be caused by bugs. To find out where the loss happens you need to do a packet capture at the specific devices to see where exactly the packets get lost. Some self-reported statistics on these devices about packet load and packet loss might help too. Commented Apr 5, 2023 at 10:43
  • This is far too broad, however in my experience, the most common cause of packet loss is insufficient capacity, specifically one pipe connecting to a smaller pipe.
    – Greg Askew
    Commented Apr 5, 2023 at 10:58
  • @SteffenUllrich The fact that if happens for multiple random clients at the same time suggests that is not an issue with individual clients' devices/routers imo...
    – tonso
    Commented Apr 5, 2023 at 11:14
  • @tonso: "not an issue with individual clients' devices/routers" - I agree. But many clients usually share at least some part of the network path. For example clients using the same ISP will share most of the path, then several ISP might use the same upstream. And even if all come from different ISP and upstream they will share the last part of the path through the infrastructure where the server is located. Commented Apr 5, 2023 at 11:51
  • @GregAskew Ok, but how to narrow down the search then?
    – tonso
    Commented Apr 5, 2023 at 11:59

0

You must log in to answer this question.

Browse other questions tagged .