Misadventures in HPC Cloud Migration #10

Cloud Optimized Orchestration

Misadventures in HPC Cloud Migration #10

You’re probably fed up with me drip feeding you, so let’s get down to brass tacks. How do we (HMx Labs) deal with the whole “bring your own scheduler and add a cloud orchestration mechanism” scenario.

As alluded to in the previous post, the key question to answer for any cloud orchestration component is which VM type to start (and stop). Part of this is naturally determined by the constraints of your HPC code. CPU/GPU platform, memory per core, minimum performance, SLA & timelines and so on. The remainder of the answer comes from something we discussed in part #7, understanding your costs. 

Within the above constraints, it is the job of the orchestration mechanism to pick the most cost effective VM (or VMs, there may not be enough of the first choice available). You could, I suppose, make this the responsibility of the orchestrator. But I’m old school. I follow the *nix philosophy of having each tool do one thing but do it very well. Feel free to berate (or congratulate) me in the comments.

This leads us to another component. One who’s whole reason for existence is to determine a performance normalized cost to run your workload on every possible bit of infrastructure you may conceivably choose to run it on. The goal is simple. Achieving it, usually less so. 

From there, the orchestrator’s job becomes less onerous. Start up or stop cores/VMs as dictated by the scheduler or metrics and pick those VMs as defined by the optimizer. Neat huh. The orchestrator just needs to be very good at turning on (and off) lots of VMs quickly. Its told how many and told which ones.

To some, this may seem unnecessarily complicated. It really depends on your use case. If your HPC cloud bills are only four figures, it probably seems like a lot of work. If you’re a typical systemically important bank, this approach could be saving you four or five figures every month in cloud costs. Yea, I thought that might get your attention… 

Now where did we land on spot vs on demand… you know the drill. Answer in the next post. Oh and also hit that subscribe button!