0

I have a system deployed in AWS EKS, sometimes spot instances metrics is down, and API call to these nodes are very slow. Here is my system:

  • 1 EKS cluster
  • 1 on-demand node group
  • 1 Karpenter v0.29.2 provisioner to provision spot nodes (2vcpu, 8-16G - ["m5a.large", "m5.large", "m6i.large", "r5a.large", "m5d.large", "r5.large", "r6i.large", "r5n.large", "c5.xlarge", "c6i.xlarge"])
  • and my API pods running in the nodes (6-8 pods per node ).

I found that the API was slowed down around >20s (normal is 10-20ms), and then I checked the Prometheus metrics "up" and it sometimes got 0 (1-3 minutes each), even though the traffic did not grow much. After the metrics got back to 1, API performance started to become normal. Meanwhile sometimes connecting to Redis or Mongo got timeout or refused.

enter image description here

Here is the node system log:

enter image description here

I saw the nodes' system logs and compared them to logs of other working fine nodes, and there weren't any differences.

I wonder why the Karpenter spot nodes were slowed down sometimes. Does anyone has faced this problem, or any idea on how to debug this? Thank you so much!

Update: I moved to use on-demand in Karpenter provisioner, and it still has the same issue. Especially, when I increase the number of API pods in a node, it happens more frequently, so I decided to decrease the number of pods for a while. The metrics for CPU and memory are around 50-70%. Disk IOPS is less than 600 IOPS on gp3 with 3000 IOPS. I don't think it's reaching the limit threshold or getting throttled.

2
  • Spot is a billing construct, not a technical construct. The only difference between standard and spot instances is they can be taken from you with a few minutes notice. Have you compared with an on-demand node and consistently found a difference? I'd try releasing your spot instance and getting another, very rarely AWS hardware can have problems but getting a new instance should move you to new hardware.
    – Tim
    Commented Sep 29, 2023 at 20:57
  • Thank you for your suggestion, but I moved to use on-demand in Karpenter provisioner, and it still has the same issue. Especially, when I increase the number of API pods in a node, it happens more frequently, so I decided to decrease the number of pod for a while.
    – Tristan
    Commented Sep 30, 2023 at 6:00

0

You must log in to answer this question.