HPC Needs to Get Smarter

HPC as an end user really hasn’t improved much over the years. The user experience still forces you to become an expert in HPC rather than your chosen domain. Time to fix that?

HPC Needs to Get Smarter

I read an interesting article earlier this week about a use case where a large 120 CPU machine was actually slower to run a CFD job than an 8 core laptop. While I’ve never personally seen that exact use case, I have run into poor performance due too fine a level of granularity in the job breakdown.

In financial services, the workloads are rarely (if ever) going to complete on your laptop before the heat death of the universe, but that doesn’t mean that performance can’t be significantly hampered by breaking the workload down into more jobs than necessary. For precisely same reasons as the example above, the communication, split and combination overheads.

And this is what frustrates me. Firstly, I feel that the HPC world has been slow to wake up to the reality of large multi core CPUs. There are far too many use cases where we should have long ago given up on MPI and OpenMP and just embraced multithreaded single process solutions. Or learnt from distributed systems architectures.

Secondly, we no longer operate on a fixed size clusters. Heck even when we did operate on fixed size clusters there was never any guarantee of how many CPUs your particular job would get anyway.

How many tasks (to allocate to individual CPUs) a workload is broken up into is generally a problem left to the HPC application/user. And it shouldn’t be. The answer to that question is a function of the available resources, not something the end user should have to decide or think about.

HPC UX is pretty sucky if we’re honest with ourselves.

Of course, we don’t have the right interfaces or means of communication between the various components in a HPC platform and the application code to do this today. But that’s a human and social problem not a technological one.

Note I said HPC platform and not scheduler. I really don’t want all of that to be locked to the scheduler. There’s more in play here.

I spoke about some of this in my little soliloquy at HPC Club and it’s on the list for the Supercomputing Strategy Group. Isn’t it time we started to tackle these problems?