I have an AWS EC2 Auto Scaling Group for GPU accelerated g4dn-2xlarge instances.
Recently we've had a couple of days where the ASG times out after 5 minutes scaling from 0 to 1 and the instance it spun up sticks around until we terminate it manually.
The ASG won't scale up while this instance is around and cluster-autoscaler just logs "Failed to find readiness information for" until termination. This halts our processing pipeline of course when this happens.
I've set up DataDog monitors for these log messages, but how do I troubleshoot the root cause? Is 5 minutes too small for this instance type? Where can I increase this setting?
EKS is 1.28 on cluster-autoscaler 1.28.4