By Hamza in HPC — Mar 31, 2023

High Performance Computing on the Cloud

Running HPC clusters is much easier on the cloud where there’s unlimited capacity and no scheduling problems right? That’s certainly what all the cloud vendors always tell me, but how true is it?

HPC workloads are often seen as an easy target for migration to cloud. The large compute requirements coupled with what, at first glance, often seems to be well-defined, well-isolated binaries running discrete and stateless computation; thus making them one of the first candidates for migration to cloud. All in the hope of reducing the on-premises footprint and costs or accessing additional compute resources for burst capacity.

Whilst the above is somewhat true and a like for like migration, (i.e. one that results in the workload being scheduled in the same manner as it is on-premises) is certainly very feasible, we’re a long way from optimising HPC on cloud compared to the level of optimisation that has been achieved for on-premises workloads. Ignoring, for now, the myriad of complexity that is involved in moving from a private corporate network to a public cloud and all the security concerns and internal politics that inevitably must be dealt with.

The On-Prem Story

Decades of optimisation and the use of bespoke scheduling solution such as those provided by IBM Spectrum Symphony or TIBCO Grid Server (other products including FOSS are available) have resulted in the ability to finely control, prioritise, and allocate workloads to a finite and well-defined on-premises compute capacity.

Typically, the total available compute capacity is fixed (defined by the total number of servers available in the data centre). Whilst this number may vary over time as new machines are bought, old machines decommissioned and borrowing agreements put into place or removed with owners of servers in other parts of the same company, it is relatively static. Certainly not something that varies from day to day. Nor from one workload to the next.

In more complex (but rather common) scenarios, this compute may be subdivided into multiple categories, for example, a higher memory per core group, a group with GPU or a group with faster networking. This provides the ability to limit certain workloads to a subset of the total available hardware if required and, more importantly, to allocate limited resources in the most effective manner possible. For example, first order financial risks may require only 2GB of RAM per core whilst portfolio risk types such as RWA or CAV may require a much larger memory footprint of perhaps 8 or even 16GB per core. Allocating first order risk calculations to a machine with 16GB per core is clearly wasteful (when RWA calculations also need to be run). Likewise, it would make no sense to run a CPU only-based calculation on a machine with CUDA GPUs available (assuming, of course, there is a corresponding CUDA workload that can make use of the resource). Of course, if the 16 GB/core or GPU hardware is sat idle, it makes sense to use it for anything else that is available.

The demand for compute capacity is managed by assigning each of the compute grid clients a portion of the total compute capacity (often termed the reserved or guaranteed capacity). Clients are able to allocate a priority to their workloads to provide further granularity to the execution order and assignment of resources to their workload.

This results in a fairly simple and, importantly, predictable, grid scheduler that is responsible for balancing demand from clients (each with its own guarantee and priority) against a fixed amount of compute resource (split into multiple pools).

A typical on-premises grid scheduler then is dealing with the following dimensions. A (relatively) fixed total compute capacity potentially split into

Multiple pools or groups of capacity with varying resource allocation
At a fixed cost
At a fixed hardware configuration
In a fixed location(s)

Multiple (but fairly static) number of clients/,users each of which has a

Variable but largely predictable demand
Varying priorities of workload

i.e. we have somewhere between two and five dimensions for our HPC workload scheduler to optimise across. The big question is, does cloud make this simpler? Cloud vendors would certainly have you believe it does. After all, with unlimited compute on tap, scheduling is a non-problem, right?

Reach for the clouds

As workloads are migrated to cloud, the demand side of the equation still holds true. The supply side, however, changes drastically. There is no longer a fixed quantity of compute resource to allocate against.

If you’re already engaged with a cloud vendor, chances are that they will be encouraging you to use cloud (read vendor) native solutions to now distribute your workload. The claim will be that there is no longer a limit on the supply side of the scheduler’s equation. Grid clients can simply and indiscriminately consume the capacity they require on demand. The cost savings on grid scheduler licenses are certainly tempting too.

If (and this is a big if) this genuinely holds true for your use case, a migration to cloud (provided you manage to do it all in one go and the cloud costs are not a factor) results in a significantly simplified workload scheduling problem. Potentially with only one dimension, demand. If this is true in your use case, firstly, great for you! Secondly you can probably stop reading now. This, of course, also assumes that your entire workload is either of the same priority or that you are able to feed the workload in priority order or you’re one of the few exceptions that have a perfectly horizontally scalable system that can parallelise every single job.

If you’re not so fortunate though, if, for example, you can’t migrate your entire workload from on-premises to cloud in one fell swoop and need to balance across the two for an extended period of time; if any part of your workload relies on a very specific type of resource that is in limited supply (GPUs for example); or if you need to care about optimising the cost of your compute on an ongoing basis, then I’m sorry to inform you that your workload scheduling problem just became a lot more complicated. If you’re attempting to use multiple cloud vendors, then it’s even worse.

All CPUs are equal, but some are more equal than others

In an attempt to reduce complexity, let’s for now consider only those machine types that are marketed by the vendors as either general purpose, compute specific or HPC specific. (i.e. without looking at any VMs targeted to memory intensive or with specific accelerators for example).

AWS alone, at the time of writing, offers 21 different families of VM and over 300 different types in all. Microsoft Azure has bewildering array of categories and subcategories but over 300 SKUs also. GCP has a more comprehensible 7 families but still provides around 125 types but with added bonus of being able to define your own VM type beyond this should you wish to. Again, this is only looking at general purpose and compute optimised VM types.

Which of these is optimal from a price performance perspective for your workload? What if the pricing changes on any one of those VM SKUs? This requires knowledge not only of the price of each VM SKU but also their relative performance (for your particular workload). Whilst the headline specification of a virtual machine generally specifies the number of vCPUs (and memory) the machine has, no two CPU models are equal (and this article will deliberately ignore SMT for now. That will be discussed in greater detail in its own right).

In order to compare the costs, we need to compare the performance of one VM SKU relative to another too. Ideally, we’d compare the cost of some standardised CPU per hour. Of course, no such metric exists; in fact there is no directly comparable benchmark or metric to provide your scheduler as an input. Neither AWS nor GCP provide any comparative metrics of the performance of their various machine types (though both have in the past).Azure does provide some indication of this with ACU (Azure Compute Unit) for its cloud VMs, but this is expressed as a range, somewhat understandably as each workload is different.

Which ties in neatly to the next point. Let’s pretend for the moment though that not only do we in fact have such a metric but that the performance normalisation corresponds well to your particular use case. Put another way, whatever benchmark was used to compare performance presents a workload that correlates well to your own workload. A good scheduler should now be able to allocate workload cost effectively across various VM types.

Around the world in 80 data centres

However, the pricing of VM SKUs (within a single cloud provider) is not homogenous across regions. Nor is there any guarantee that the performance of a given VM SKU is homogenous across regions. Or for that matter that the performance of the SKU is consistent through time either: welcome to shared VMs.

AWS offers 27 regions. Azure has almost 80 and GCP provides a choice of 37. For most purposes, this choice will be naturally limited by legal and regulatory requirements, but even with these limitations most customers will have a choice of at least three to five different regions. That’s a three to five multiplier on the number of possible VM types (approximately, not all types will be available in all regions).

Will that be cheque, card or cash sir?

For any given combination of VM SKU and region, you then need to select how to pay for it with a choice of spot, on demand or reserved instances (for anywhere between one and three years). The correct answer to this is probably something even a good scheduler is unable to determine at runtime as the response will depend on factors the scheduler has no awareness off, such as the service level agreements for the output and the total compute requirement.

Would you like fries with that?

You may be thinking we’re coming to the end of the cost-based story, but so far we have touched only on one side, the supply side, of it. There is also the demand side cost to consider. In an on-premises world, with fixed costs and limited resource availability, HPC workloads are decided by the business in a relatively static fashion. For example, an equity derivatives trading desk will decide which risk measures they wish to see, this will be augmented by the requirements of the risk management office and the regulators. There will be changes to this over time, but the workload from one day to the next is relatively constant.

Speak to any trading desk or risk manager though and, if the availability of endless compute were to present itself, there are many additional risk metrics or scenarios they may wish to run, provided they can be run at or below a specified cost. In other words, in addition to a core workload that must be executed with a low sensitivity to price, there is a potentially much larger but highly cost sensitive (but less time critical) workload that a good scheduler should be able to manage.

Cloudy with a chance of containers

VMs are no longer the only way to access compute on cloud. They may not even be the cheapest. Most HPC systems could equally well exploit containerised services (either based on Kubernetes or other container platforms) or serverless compute offerings such as AWS Lambda/ Azure or GCP Functions. There are also other cloud native features such as AWS/Azure/GCP Batch and other HPC on cloud offerings. This again needs to be costed in a performance normalised manner such that the grid scheduler can allocate workload to these resources as appropriate. I don’t intend to delve into this in any depth, it’s a subject quite worthy of covering on its own but it is worth mentioning just to highlight the additional degrees of freedom a cloud (native) grid scheduler needs to cope with.

So, for a cloud HPC system we have:

A variable total compute capacity potentially split into

Across a range of possible locations
Using a mixture of hardware configurations
And potentially an assortment of technologies (VMs, Containerisation, Cloud Functions)
All with their own pricing model and structure

Multiple (but fairly static) number of clients/users each of which has

A similar base demand but an additional variable and cost sensitive workload
Varying priorities of workload

Even in the simplest possible case with a like for like migration the degrees of freedom increase by at least three (location, hardware and pricing model) assuming there is a desire to optimise cost.

Hopefully this goes some way to illustrating that HPC on cloud, far from being a simpler problem requiring less complex schedulers, is a far more complex one with many more degrees of freedom requiring both cost and resource optimisation.

This isn’t to suggest that these problems are insurmountable, or that cloud adoption isn’t worth doing. The cloud provides many benefits too and there are solutions to the above. Solutions which we will discuss in upcoming articles.