I have a VPS I manage on my own. There are running just as much as a few Node.js projects, docker projects, crowdsec. The usual CPU load is about 20%.
Occasionally server's CPU usage skyrockets to 100% and everything stops working, I'm unable to connect to it over SSH as it just doesn't respond. In the control panel of my VPS provider where I can see some monitoring information I can see it going to 100% in a very short time.
When that happened the first time I had no means to find one what was going because I didn't setup resource monitoring from, the only thing I saw was something like this is my syslogs:
Feb 14 07:18:46 v1274582 systemd[1]: [email protected]: Failed with result 'exit-code'.
Feb 14 07:18:46 v1274582 systemd[1]: Failed to start OpenVPN connection to server.
Feb 14 07:18:51 v1274582 systemd[1]: [email protected]: Scheduled restart job, restart counter is at 354411.
Feb 14 07:18:51 v1274582 systemd[1]: Stopped OpenVPN connection to server.
Feb 14 07:18:51 v1274582 systemd[1]: Starting OpenVPN connection to server...
Feb 14 07:18:51 v1274582 ovpn-server[3020525]: Options error: In [CMD-LINE]:1: Error opening configuration file: /etc/o>Feb 14 07:18:51 v1274582 ovpn-server[3020525]: Use --help for more information.
Feb 14 07:18:51 v1274582 systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Feb 14 07:18:51 v1274582 systemd[1]: [email protected]: Failed with result 'exit-code'.
Feb 14 07:18:51 v1274582 systemd[1]: Failed to start OpenVPN connection to server.
Feb 14 07:18:56 v1274582 systemd[1]: [email protected]: Scheduled restart job, restart counter is at 354412.
Feb 14 07:18:56 v1274582 systemd[1]: Stopped OpenVPN connection to server.
Feb 14 07:18:56 v1274582 systemd[1]: Starting OpenVPN connection to server...
Feb 14 07:18:56 v1274582 ovpn-server[3020557]: Options error: In [CMD-LINE]:1: Error opening configuration file: /etc/o>Feb 14 07:18:56 v1274582 ovpn-server[3020557]: Use --help for more information.
Feb 14 07:18:56 v1274582 systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Feb 14 07:18:56 v1274582 systemd[1]: [email protected]: Failed with result 'exit-code'.
Feb 14 07:18:56 v1274582 systemd[1]: Failed to start OpenVPN connection to server.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^>Feb 14 14:30:13 v1274582 systemd[1]: Mounting FUSE Control File System...
Feb 14 14:30:13 v1274582 systemd[1]: Mounting Kernel Configuration File System...
Feb 14 14:30:13 v1274582 systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
Feb 14 14:30:13 v1274582 systemd[1]: Starting Flush Journal to Persistent Storage...
Feb 14 14:30:13 v1274582 systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
Feb 14 14:30:13 v1274582 systemd[1]: Starting Load/Save Random Seed...
Feb 14 14:30:13 v1274582 systemd[1]: Starting Apply Kernel Variables...
As seen from this log something went wrong at around 07:18 AM and then it added new longs after I forcefully rebooted my VPS using CP. To be prepared better for the next time I setup atop
and set an interval of 10 min for each log.
Today it happened again and I checked atopsar
:
06:30:01 cpu %usr %nice %sys %irq %softirq %steal %guest %wait %idle _cpu_
06:40:01 all 30 0 8 0 0 0 0 0 161
0 16 0 4 0 0 0 0 0 80
1 15 0 4 0 0 0 0 0 81
06:50:01 all 30 0 7 0 0 0 0 0 162
0 15 0 3 0 0 0 0 0 81
1 15 0 4 0 0 0 0 0 81
07:00:01 all 31 0 10 0 0 0 0 0 159
0 16 0 5 0 0 0 0 0 79
1 16 0 5 0 0 0 0 0 79
07:10:01 all 30 0 7 0 0 0 0 0 163
0 14 0 4 0 0 0 0 0 82
1 15 0 4 0 0 0 0 0 81
14:30:13 ......................... logging restarted .........................
14:40:14 all 41 0 19 0 0 0 0 0 139
0 20 0 9 0 0 0 0 0 70
1 21 0 10 0 0 0 0 0 69
14:50:14 all 31 0 7 0 0 0 0 0 161
0 15 0 3 0 0 0 0 0 82
1 17 0 4 0 0 0 0 0 79
15:00:14 all 32 0 7 0 0 0 0 0 161
0 16 0 4 0 0 0 0 0 81
1 16 0 4 0 0 0 0 0 80
From what I can tell 10 min interval was not enough to catch anomaly state. I set it to 5 min now but I doubt it will be enough as I'm not sure just how fast it goes to 100%. Logging it every few seconds seems to be a bit too much.
Question: What else could you recommend me to find out what is causing this issue?
Thank you!
ssh
, connect with a console. If you don't have a console - find another provider.