Compute Utilisation: Not what you think it means

A brief guide to understanding what utilisation in compute clusters means and how this relates to your code, supercomputers and cloud providers.

Compute Utilisation: Not what you think it means

There’s been a lot of chatter about utilisation rates in the cloud (and GPUs in particular) of late. I have no more information than you do on this topic 😉 But it is abundantly clear to me is that a lot of folks are talking about utilisation with only the most superficial understanding of what it means.

Usually treating it the way Jeremy Clarkson treats horsepower. The higher the number the better. As most HPC old hands will tell you, this isn’t always true. (By the way more horsepower isn’t always better either).

But first let’s start with some context. Utilisation means different things to different people and context is everything. Not only that, what a “good” means depends entirely on context. I think it’s worth also stating that most people when referring to utilisation rates means the usage of the CPU or GPU of a machine. Sure that’s one measure of utilisation, there are others too.

Let’s start small. You’re a developer testing some code on a single CPU or GPU. This is one of the few cases where aiming for 100% utilisation might be the goal. Even then, its only possible if the code is purely compute bound. 

Remember, the code serves a purpose, and it is entirely feasible that this purpose (be that running stochastic parrots or calculating imaginary financial valuations) may be served best when some other part of the machine s at its limit. Not the CPU or GPU. That might be the memory bandwidth, the network connection or even how much RAM or vRAM is used. Not every workload is perfectly compute bound, even for highly compute intensive workloads. There’s a reason we have things like Infiniband, NVLink and APUs.

Great, so we have a compute intensive workload, that perhaps on its own cannot hit 100% utilisation. Surely, we need to run as many instances of that workload till we do hit 100% right?

Maybe.

Assuming you even own the whole GPU and have exactly one workload to run on it, and are able to run multiple instances, and some whole number multiple of that workload precisely causes 100% utilisation (going over 100% will significantly impact performance) this might be desirable. Even then it assumes that some other part of the system will not be bottlenecked before your CPU or GPU. 

If you have multiple types of workload the problem gets even harder. You need to juggle resource allocation between competing demands, and the “noisy neighbour” problem means that regardless of what you do, one of those workloads will negatively impacted by the other even if not at 100% utilisation. Getting the best output overall may mean deliberately running below 100% utilisation. Granted it won’t mean running at 15% either though.

Let’s take a step up the ladder, now you’re running a large compute cluster. You have multiple users with varying types of demand. Some of them have long duration workloads with a low sensitivity to time to result. Others have short duration highly time sensitive jobs. Doesn’t matter if this end of day  and interactive financial risk, or AI training and inference. The patterns are similar. If you allow your entire cluster to run at 100% utilisation with long running jobs, your short jobs will never run within timeframes acceptable to your users. Keeping even one CPU or GPU idle to serve that short job means you’re no longer at 100% utilisation.

That’s before we even talk about the queue depth. What if running at 100% also means you have a stack of jobs a week long before you even get a look in? Still want to run at 100% all the time? Will your users want you to?

These are challenges that centre directors and heads of grid computing have wrestled with for decades. These are not new problems. Granted, they more of a concern when they’re at orders of magnitude larger scale. Don’t get me wrong, I’m not suggesting the answer is to set light to your balance sheet and stockpile silicon. Just don’t fall into the trap of thinking that 100% utilisation is the target. The real number may be lower than you’d expect too.

If your job is to provide infrastructure to other people (i.e. you give them VMs and not a scheduler end point to submit workload) then almost none of the above even applies. If you’re a cloud or internal department that provisions VMs for your organisation all you really care about is pricing and how many you have left. What the utilisation of any resources within those VMs (be that CPUs, GPUs or memory) is quite frankly none of your beeswax. (Well maybe a little if you’re an internal service provider rather than an external cloud or MSP).

Utilisation rates are someone else’s problem. You might worry about your stock levels and I guess you could argue that’s another form of utilisation, but it really does not mean the same thing.

If I ran a cloud and I’m making money hand over fist, but my GPU utilisation rates are low that’s probably a good thing as my electricity bill is lower than it would be if they were high!

So, I still don’t know if AWS has a 15% or 80% utilisation rate on their GPUs. I do know that it doesn’t matter as much as you think it does and the importance of that metric is rather overstated.


I often get asked about the AI generated images that accompany some of the articles and how they are generated, well this one was particularly challenging and honestly I'm not sure why I persisted as long as I did trying to get it right... but here's the prompts that eventually led to it

ChatGPT - Picture request design
Shared via ChatGPT