HPC Maturity and AI

HPC capability and maturity is about so much more than size

HPC Maturity and AI

AI is pushing HPC forward, right? Supercomputers have gone from the size of a couple of tennis courts to the size of Manhattan. Sorry I mean AI Factories 😁 HPC can only be getting better, right? Maybe. But maybe not.

Whilst that might be the case if all you’re building is city size compute clusters (and even then I’m not so sure) the reality for many with an existing supercomputer or large grid cluster is that the addition of AI workloads is actually regressing the HPC maturity level.

Adding AI workloads, and more critically, AI specific hardware in organisations that have existing HPC can lead to increased workload fragmentation. The greater the differences in characteristics between the AI and other HPC workloads the worse this problem becomes. To take an extreme example, splitting Monte Carlo based financial risk calculations with AI training would result in two completely segregated compute clusters. And with it all the problems that brings with utilisation levels, increased management costs and so forth.

Whilst to some extent it has always been the case, with AI the conversation on measuring HPC capability has firmly become a question of “how big” or “how fast”. HPC maturity however depends on so much more than this. 

If we were to measure HPC maturity we’d need to do so across several axes. Off the top of my head these would include workload consolidation, operations automation, observability, hardware standardization, elasticity, cost optimization and of course, size/speed.

Whilst AI may be pushing boundaries in some of these it is also introducing changes that are having to be incorporated alongside (and not replacing) existing supercomputer tech stacks. This, almost by definition, means increased fragmentation. In some cases, such as the example above, we don’t even a common scheduler solution that would bridge both worlds (yet).

I thought it might be a good idea to help people try and evaluate their HPC maturity level so last night I did the easy bit and cooked up an app to measure and display HPC maturity in an organisation. Now the hard bit, figuring out the right questions to include which people can answer themselves, regardless of their level of HPC expertise.

Suggestions gladly accepted!