Misadventures in HPC Cloud Migration #12
Dealing with Spot
It seems I have may led some of you to take me a little too literally in the opening lines of my last post. Oops. I Blame my poor attempts at humour. I shall endeavour to make my hilarity more obvious in future.
So should you really run everything on spot? Yea of course you should! 😊 What do you mean your workload isn’t sub second Monte Carlo simulations? What do you mean your tasks have inter dependencies and losing one task can lose hours of calculations on dependent nodes. Talk about making life hard for yourself.
If your HPC workload was engineered to run on premises (i.e. on static always on capacity) chances are, even if you don’t realise it, there will be aspects of your system that rely on this. The older your codebase the more likely this is. Even if you are running Monte Carlo sims. (This comes back a little to the mindset shift I mentioned in my last post on how resilience is achieved on premises vs cloud). Naturally this needs to be addressed first.
Not all HPC workloads can be engineered to run on pre-emptible compute capacity. Highly coupled long running workloads are generally considered to be a poor fit for this use case. As was quite correctly pointed out in a reply to my last post (worth reading if you haven’t seen it already), you pay for any capacity used regardless of whether you got results from the compute or not.
Here's the thing though. It’s a balancing act. Spot capacity can be so much cheaper than anything else, sometimes it cheaper to pay for lost compute than paying for on demand or reserved instances.
Without breaking NDAs there’s only so much I can give away, but contrary to what you may believe, the hyperscalers do actually want you using spot capacity. Not as much as they want you to use RIs or on demand, but they really do want to sell you spot also.
Running everything on spot might not be possible for everyone, but it’s a good aspiration.
If you’re not already using spot capacity for your HPC workloads I would encourage you to investigate. Ask your cloud solution architect for help. Or me.
We’re not done dealing with the challenges of dynamic compute capacity yet. Stay tuned, hit subscribe!