What is the best practice for monitoring the system, should the CPU alerts be based on the regular CPU usage or load average? I'm wondering what approach is being used in big cloud environments.
1 Answer
Reaching 100% CPU utilisation is not what should trigger an alert, remaining at 100% CPU utilisation might be something to worry about.
Sharp spikes are usually good
Load fluctuates but your system does not reach the limit of available resources and doesn't experience continuous resource starvation.
When the CPU load spikes sometimes reach 100% your system is correctly sized, when they never reach 100% your system might be (somewhat) oversized.
Nothing to worry about.
A flat line is usually bad
When your CPU load remains at 100% CPU utilisation for a long time, your system does not have all the resources it needs.
You may need to scale up, or scale out more. Intervention and sending a pager alert might be appropriate.
On other end of the spectrum, when your CPU load remains at 0% CPU utilisation consistently, either your system may be terribly oversized and you might want to downsize, or something else is wrong (and missed by your monitoring). You probably don't want a pager alert after hours but should still follow up during business hours if that is a long term trend.
-
Great answer. Only thing I would add is to add alerts for what is important to your application. If response time is important, alert on that regardless of CPU utilization. In our application high CPU utilization results in increased response times. Alerting on response time usually triggers before the CPU alerts.– Tim PCommented Dec 11, 2023 at 17:55