Understanding Load Average on Linux
One of the things that really confused me back in the last millennium was understanding the results I would get from “top -c” or “uptime”. They showed load averages that seemed to make sense when they were low: “0.32 looks like a low load. Is that 32 and I’m using 1/3 of the server capacity?” It got more confusing when the number was something like 2 and the server seemed like it wasn’t busy at all. That made no sense because then wouldn’t 2 mean 200%??
The load numbers on a Linux system are often confusing for people coming from Windows and other operating systems. The number often looks like a percentage, but it isn’t.
When you evaluate the system load you have to know the number of CPUs your system has. Otherwise those numbers mean nothing.
To describe this in general terms as it was taught to me, I would first start with the concept that a CPU core can do 1 thing at a time. Knowing this, the number of items in the system run queue can start to make sense. The system queue is the jobs that are either being worked on, or about to be started.
In a 1 CPU system any number below 1 means that the system is keeping up with its tasks. It has 1 CPU and there is on average 1 task being worked on (or about to be started). In a 2 CPU/dual core system, if there is 1 item in the system queue, one or the other CPU is working on it or just about to grab it and you won’t have any delay. If you are on a 16 core server and your load average is creeping up to 14, you are still not in bad shape because that’s less than your CPU core count and all of the jobs are being worked on or are about to be started.
Now if your numbers are double your CPU count, then you are definitely overloaded and the system is falling behind. All of the CPUs are busy and they all have another job to work on after they are done with their current job.
After that, you need to watch out for a snowballing effect as the system starts to spend more and more time juggling its resources and less time processing the demand of the users.
Oh and why are there three numbers? Those are averages for different time periods. They are a 1 minute average, a 5 minute average and a 15 minute average.
( This explanation can get much more complicated. For example those 3 numbers are exponentially weighted moving averages (EMA or EWMA), meaning that in a 15 minute average, the most recent load measurements carry more weight than the older ones. So a load that was ramping up in that 15 minute period would produce a much higher average than a load that was inverse and decreasing at the exact same rate in that 15 minute period. But since I want to keep this explanation simple, I’m not going to even mention that.)
Hope that helps!
Very good explanation. Thank You!