I have a system deployed in AWS EKS, sometimes spot instances metrics is down, and API call to these nodes are very slow. Here is my system:
- 1 EKS cluster
- 1 on-demand node group
- 1 Karpenter v0.29.2 provisioner to provision spot nodes (2vcpu, 8-16G - ["m5a.large", "m5.large", "m6i.large", "r5a.large", "m5d.large", "r5.large", "r6i.large", "r5n.large", "c5.xlarge", "c6i.xlarge"])
- and my API pods running in the nodes (6-8 pods per node ).
I found that the API was slowed down around >20s (normal is 10-20ms), and then I checked the Prometheus metrics "up" and it sometimes got 0 (1-3 minutes each), even though the traffic did not grow much. After the metrics got back to 1, API performance started to become normal. Meanwhile sometimes connecting to Redis or Mongo got timeout or refused.
Here is the node system log:
I saw the nodes' system logs and compared them to logs of other working fine nodes, and there weren't any differences.
I wonder why the Karpenter spot nodes were slowed down sometimes. Does anyone has faced this problem, or any idea on how to debug this? Thank you so much!
Update: I moved to use on-demand in Karpenter provisioner, and it still has the same issue. Especially, when I increase the number of API pods in a node, it happens more frequently, so I decided to decrease the number of pods for a while. The metrics for CPU and memory are around 50-70%. Disk IOPS is less than 600 IOPS on gp3 with 3000 IOPS. I don't think it's reaching the limit threshold or getting throttled.