Tuesday, May 7, 2013

Why CPU Utilization is a Misleading Architectural Specification


Actually, this post only has a little to do with queueing theory.  But I can't help tagging it that way, just 'cause.

Once upon a time, before the Internet, before ARPANet, even before people were born who had never done homework without Google, computer systems were built.  These systems often needed to plow their way through enormous amounts of data (for that era) in a relatively short period, and they needed to be robust.  They could not break down or fall behind if, for instance, all of a sudden, there was a rush in which they had to work twice as fast for a while.

The companies that were under contract to build these systems were therefore compelled to build to a specified requirement.  This requirement often took a form something like, "Under typical conditions, the system shall not exceed 50 percent CPU utilization."  The purpose of this requirement was to ensure that if twice the load did come down the pike, the system would be able to handle itthat the system could handle twice the throughput that it experienced under a typical conditions, if it needed to.

One might reasonably ask, if the purpose was to ensure that the system could handle twice the load, why not just write the requirement in terms of throughput, using words something like, "The system shall be able to handle twice the throughput as in a typical load of work"?  Well, for one thing, CPU utilization is, in many situations, easier to measure on an ongoing basis.  If you've ever run the system monitor on your computer, you know how easy it is to track how hard your CPU is working, every second of every day.  Whereas, to test how much more throughput your system could handle, you'd actually have to measure how much work your CPU is doing, then run a test to see if it could do twice as much work without falling behind.  A requirement written in terms of CPU utilization would simply be easier to check.

For another thing, at the time these requirements were being written, CPU utilization was an effective proxy for throughput.  That is to say, in the single-core, single-unit, single-everything days, the computer could essentially be treated like a cap-screwing machine on an assembly line.  If your machine could screw caps onto jars in one second, but jars only came down the line every two seconds, then your cap-screwing machine had a utilization of 50 percent.  And, on the basis of that measurement, you knew that if there was a sudden burst of jars coming twice as fast—once per second—your machine could handle it without jars spilling all over the production room floor.

In other words, CPU utilization was quite a reasonable way to write requirements to spec out your system—once upon a time.

Since those days, computer systems have undergone significant evolution, so that we now have computers with multiple CPUs, CPUs with multiple cores, cores with multi-threading/hyper-threading.  These developments have clouded the once tidy relationship between CPU utilization and throughput.

Without getting too deep into the technical details, let me give you a flavor of how the relationship can be obscured.  Suppose you have a machine with a single CPU, consisting of two cores.  The machine runs just one single-threaded task.  Because this task has only one thread, it can only run in one core at a time; it cannot split itself to work on both cores at the same time.

Suppose that this task is running so hard that it uses up just exactly all of the one core it is able to use.  Very clearly, if the task is suddenly required to work twice as hard, it will not be able to do so.  The core it is using is already working 100 percent of the time, and the task will fall behind.  All the while, of course, the second core is sitting there idly, with nothing to do except count the clock cycles.

But what does the CPU report is its utilization?  Why, it's 50 percent!  After all, on average, its cores are being used half the time.  The fact that one of them is being used all of the time, and the other is being used none of the time, is completely concealed by the aggregate measurement.  Things look just fine, even though the task is running at maximum throughput.

In the meantime, while all of these developments were occurring, what was happening with the requirements?  Essentially nothing.  You might expect that at some point, people would latch onto the fact that computing advances were going to affect this once-firm relationship between CPU utilization (the thing they could easily measure) and throughput (the thing that they really wanted).

The problem is that requirements-writing is mind-numbing drudge work, and people will take any reasonable measure to minimize the numbness and the drudge.  Well, one such reasonable measure was to see what the previous system had done for its requirements.  What's more, those responsible for creating the requirements were, in many cases, not computer experts themselves, so unless the requirements were obviously wrong (which these were not), the inclination was to duplicate them.  That would explain the propagation of the old requirement down to newer systems.

At any rate, whatever the explanation, the upshot is that there is often an ever-diverging disconnect between the requirement and the property the system is supposed to have.  There are a number of ways to address that, to incrementally improve how well CPU utilization tracks throughput.  There are tools that measure per-core utilization, for instance.  And even though hyper-threading can also obscure the relationship, it can be turned off for the purposes of a test (although this then systematically underestimates capacity).  And so on.

But all this is beside the point, which is that CPU utilization is not the actual property one cares about.  What one cares about is throughput (and, on larger time scales, scalability).  And although one does not measure maximum throughput capacity on an ongoing basis, one can measure it each time the system is reconfigured.  And one can measure what the current throughput is.  And if the typical throughput is less than half of the maximum throughput—why, that is exactly what you want to know.  It isn't rocket science (although, to be sure, it may be put in service of rocket science).

<queueingtheory>And you may also want to know that the throughput is being achieved without concomitantly high latency.  This is a consideration of increasing importance as the task's load becomes ever more unpredictable.  Yet another reason why CPU utilization can be misleading.</queueingtheory>

No comments:

Post a Comment