HPC 2.0?
What features would you look for in HPC 2.0
I recently saw a post referring to AWS’ Parallel Compute Service as HPC 2.0. It certainly looks like a great managed offering and if you’re looking for managed Slurm it fits the bill perfectly.
If someone had asked me what I’d like from HPC 2.0 though, it falls short. Probably more to do with the size of my wish list than anything else!
The following is a bit of a mish mash of what I’d want as HPC engineer and as a HPC user but here’s what I wish was already available out of the box
- Cost based task scheduling. Sometimes there are workloads that I would like to be run but only make business sense if they can be run at a particular price point (a rather low one usually!). I want to submit a job to the scheduler and if at any point there is “free” capacity or the unit cost of workload dips below a certain threshold it should be run. Else it just sits in a queue waiting.
- Data aware scheduling. I see a lot of HPC workload that is no longer CPU bound. Crazy right. Give me a scheduler that knows where data lives. At the local worker cache, to rack, to zone to region level and can schedule my work accordingly.
- Of course, that means you need the data cache the scheduler talks too….
- And naturally all the above should be seamlessly multi cloud, elastic and cost optimized all without human intervention.
There’s more but I won’t bore you with it all.
I feel like technology has evolved massively in the web and data space with amazing tools, libraries and products to build highly scalable platforms, but we have watched it all go by in the world of HPC.
What’s your wish list of features for HPC 2.0?