0

I have a VPS I manage on my own. There are running just as much as a few Node.js projects, docker projects, crowdsec. The usual CPU load is about 20%.

Occasionally server's CPU usage skyrockets to 100% and everything stops working, I'm unable to connect to it over SSH as it just doesn't respond. In the control panel of my VPS provider where I can see some monitoring information I can see it going to 100% in a very short time.

When that happened the first time I had no means to find one what was going because I didn't setup resource monitoring from, the only thing I saw was something like this is my syslogs:

Feb 14 07:18:46 v1274582 systemd[1]: [email protected]: Failed with result 'exit-code'.
Feb 14 07:18:46 v1274582 systemd[1]: Failed to start OpenVPN connection to server.
Feb 14 07:18:51 v1274582 systemd[1]: [email protected]: Scheduled restart job, restart counter is at 354411.
Feb 14 07:18:51 v1274582 systemd[1]: Stopped OpenVPN connection to server.
Feb 14 07:18:51 v1274582 systemd[1]: Starting OpenVPN connection to server...
Feb 14 07:18:51 v1274582 ovpn-server[3020525]: Options error: In [CMD-LINE]:1: Error opening configuration file: /etc/o>Feb 14 07:18:51 v1274582 ovpn-server[3020525]: Use --help for more information.
Feb 14 07:18:51 v1274582 systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Feb 14 07:18:51 v1274582 systemd[1]: [email protected]: Failed with result 'exit-code'.
Feb 14 07:18:51 v1274582 systemd[1]: Failed to start OpenVPN connection to server.
Feb 14 07:18:56 v1274582 systemd[1]: [email protected]: Scheduled restart job, restart counter is at 354412.
Feb 14 07:18:56 v1274582 systemd[1]: Stopped OpenVPN connection to server.
Feb 14 07:18:56 v1274582 systemd[1]: Starting OpenVPN connection to server...
Feb 14 07:18:56 v1274582 ovpn-server[3020557]: Options error: In [CMD-LINE]:1: Error opening configuration file: /etc/o>Feb 14 07:18:56 v1274582 ovpn-server[3020557]: Use --help for more information.
Feb 14 07:18:56 v1274582 systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Feb 14 07:18:56 v1274582 systemd[1]: [email protected]: Failed with result 'exit-code'.
Feb 14 07:18:56 v1274582 systemd[1]: Failed to start OpenVPN connection to server.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^>Feb 14 14:30:13 v1274582 systemd[1]: Mounting FUSE Control File System...
Feb 14 14:30:13 v1274582 systemd[1]: Mounting Kernel Configuration File System...
Feb 14 14:30:13 v1274582 systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
Feb 14 14:30:13 v1274582 systemd[1]: Starting Flush Journal to Persistent Storage...
Feb 14 14:30:13 v1274582 systemd[1]: Condition check resulted in Platform Persistent Storage Archival being skipped.
Feb 14 14:30:13 v1274582 systemd[1]: Starting Load/Save Random Seed...
Feb 14 14:30:13 v1274582 systemd[1]: Starting Apply Kernel Variables...

As seen from this log something went wrong at around 07:18 AM and then it added new longs after I forcefully rebooted my VPS using CP. To be prepared better for the next time I setup atop and set an interval of 10 min for each log.

Today it happened again and I checked atopsar:

06:30:01  cpu  %usr %nice %sys %irq %softirq  %steal %guest  %wait %idle  _cpu_
06:40:01  all    30     0    8    0        0       0      0      0   161
            0    16     0    4    0        0       0      0      0    80
            1    15     0    4    0        0       0      0      0    81
06:50:01  all    30     0    7    0        0       0      0      0   162
            0    15     0    3    0        0       0      0      0    81
            1    15     0    4    0        0       0      0      0    81
07:00:01  all    31     0   10    0        0       0      0      0   159
            0    16     0    5    0        0       0      0      0    79
            1    16     0    5    0        0       0      0      0    79
07:10:01  all    30     0    7    0        0       0      0      0   163
            0    14     0    4    0        0       0      0      0    82
            1    15     0    4    0        0       0      0      0    81
14:30:13  ......................... logging restarted .........................
14:40:14  all    41     0   19    0        0       0      0      0   139
            0    20     0    9    0        0       0      0      0    70
            1    21     0   10    0        0       0      0      0    69
14:50:14  all    31     0    7    0        0       0      0      0   161
            0    15     0    3    0        0       0      0      0    82
            1    17     0    4    0        0       0      0      0    79
15:00:14  all    32     0    7    0        0       0      0      0   161
            0    16     0    4    0        0       0      0      0    81
            1    16     0    4    0        0       0      0      0    80

From what I can tell 10 min interval was not enough to catch anomaly state. I set it to 5 min now but I doubt it will be enough as I'm not sure just how fast it goes to 100%. Logging it every few seconds seems to be a bit too much.

Question: What else could you recommend me to find out what is causing this issue?

Thank you!

1
  • If you are unable to log in through ssh, connect with a console. If you don't have a console - find another provider.
    – AlexD
    Commented Feb 14 at 17:29

2 Answers 2

2

No, it doesn't hang because it consumes 100% CPU, it consumes 100% CPU because it hangs.

^@^@^@^@^@^@^@^@^@^@^@^@^@^ represent null chars (ASCII 0). It means that the log file wasn't closed and flushed to the disk properly. In turn, it means that your system crashed right at 07:18:56. A crashed system that threw a kernel panic just sits there consuming 100% CPU in a while(1) loop. The only information about what happened and why it panicked is available at the system console. If you don't have a system console - change your provider, most likely he is the cause of the panics.

You can verify if your system crashed - ping it. A network subsystem can respond to ICMP ping even if the system consumes all CPU and even if all disks are physically dead. If it is unable to respond - it is dead.

2
  • Never looked at that from this point. When I try to use my CP console it doesn't let me do it, I type password and no symbols appears at all that's why I thought my system got so busy that I can't do anything. Is there anything that could help me detect what causes these panics?
    – Darkzarich
    Commented Feb 14 at 19:30
  • 1
    An overloaded system should still be able to print to the console. On a normal system, you hit Enter and the login prompt should appear. On an extremely overloaded system, it also should appear but you need to wait a few minutes. On a panicked system, you'll get no response to the keyboard. If your provider doesn't buffer the console then you won't get anything printed to the console before you connect. You'll need to keep the console open constantly or ask the provider if they save console output somewhere.
    – AlexD
    Commented Feb 14 at 20:02
2

I would run a script in the background which starts logging with atop when the cpu usage is over 80%. For example

#!/bin/bash

while true; do
    # Get current CPU usage
    cpu_usage=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}')

    # Check if CPU usage is over 80%
    if [ $(echo "$cpu_usage > 80" | bc) -eq 1 ]; then
        # Start logging with atop
        atop -w /path/to/logfile &
        PID=$!
        echo "CPU usage is over 80%, logging with atop (PID: $PID)"
        wait $PID
    fi

    # Sleep for 1 minute before checking again
    sleep 60
done

Save this script as cpu-monitor.sh and make it executable. To run it in the background use

nohup ./cpu_monitor.sh > /dev/null 2>&1 &

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .