0

I have an AWS EC2 Auto Scaling Group for GPU accelerated g4dn-2xlarge instances.

Recently we've had a couple of days where the ASG times out after 5 minutes scaling from 0 to 1 and the instance it spun up sticks around until we terminate it manually.

The ASG won't scale up while this instance is around and cluster-autoscaler just logs "Failed to find readiness information for" until termination. This halts our processing pipeline of course when this happens.

I've set up DataDog monitors for these log messages, but how do I troubleshoot the root cause? Is 5 minutes too small for this instance type? Where can I increase this setting?

EKS is 1.28 on cluster-autoscaler 1.28.4

2
  • Same issue for me, don't know if that's the root cause, but it happens only on graviton EC2 instances. Commented May 24 at 15:55
  • @LucaTartarini it turns out there is an open bug, an old coworker of mine filed at the same company 4 years ago lol. It's because 5 minutes is the hard coded threshold for the thing that actually handles the timeout, so when it's 5minutes (or less I'd assume) the pod timesout but nothing knows to do anything with it. The default is 10 so I just went up to 10. I would guess 6 should work though :shrug:
    – Shanteva
    Commented May 24 at 19:51

0

You must log in to answer this question.

Browse other questions tagged .