Misadventures in HPC Cloud Migration #11
Spotty Resilience in Clouds
Spot, On Demand or Reserved Instances. Well, that’s easy. Spot is the cheapest. Run everything on spot. Done.
What? You want more? Tough crowd. Fine.
Before we do that though, we need to talk about some differences in how to handle resilience in the cloud vs on premises. Generally speaking, on prem systems are made to be resilient using classical engineering techniques, similar to those employed on say aircraft. You make everything as reliable as it possibly can be and then anything critical is doubled up with automatic failover to the secondary. Your servers may have 2 PSUs for example. Your database will run in an HA pair with automatic failover to the secondary. Your HPC workers are split across two data centres in case you lose one of them.
In the cloud however, generally speaking, you achieve resilience through reducing the statistical probability of failure by increasing the number of points that must fail before system failure occurs. The resilience of any one item is not important. Reliability is achieved by having enough of them to make failure statistically unlikely. To your preferred number of nines. Off the top of my head, I think most clouds quote an uptime of four nines for a single VM. Need better than that? Run multiple VMs, potentially across multiple zones or even regions.
One not so small caveat though. That uptime is once the VM is running. As far as I can remember, and please correct me if I’m wrong here, you have no guarantee on actually being able to start a VM. On spot capacity or otherwise. In fact even if you’re paying for a reserved instance.
Choosing to use spot capacity or not, is really just a function of the probability of obtaining the required capacity when needed This, in itself will scale with the number of VM types it is possible to use.
If the only thing your workload can run on is one model of GPU, I don’t fancy your chances on running much workload on spot. Heck, you probably won’t even get a look in using on demand if you picked the “wrong” GPU. You can blame all our friends busy training their LLMs for that.
On the other hand, if you can run your workload on any CPU or GPU I’d be surprised if you ever had to pay anything other than spot prices. Especially if you can run across multiple regions. If you can use multiple clouds you’re laughing all the way to the bank.
Oh, and your optimizer from post #10 should be dealing with this anyway remember!