Dreams and Wishes for HPC in 2025
A contrarian look at what I hope will be forthcoming for HPC in 2025 rather than any real predictions.
I was going to write a piece about what to expect in 2025 in the world of HPC but then I realised my crystal ball is no better than yours. In fact, looking at my investment history, I’d say it’s a fair bet to say it’s worse 😆
I could regale you with fairy tales of what I think will happen in 2025 but all that does is give you ammunition to declare “You were wrong, I informed you thusly” later this year. Plus, as the finance types among you know all too well, “Markets can remain irrational longer than you can remain solvent” and that applies equally well to time boxed predictions. I’m looking at you AI hype.
So instead, I’m going to share some of my wishes for higher performance and not so high performance compute instead. Like most wishes they probably won’t come true, but if they did, I think we’d be in a better place. Maybe, just maybe, it might even inspire one or two of you to help make them reality.
Break out of the Niche
Several times in 2024 I asked the question, “What is HPC?”. I attempted to answer it a couple of times too. I wasn’t just being obstinate. Rather I’m trying to make a point.
At its inception, HPC was special. It was something needed by a very small number of people and institutions around the world. It made sense that it was something niche. After all, how many people really needed the ability to string together commoditised hardware to obtain teraflops of compute. A few scientists, a couple of engineers, some banks and that was your lot.
Let me ask another question though. What do HPC engineers do? At its most basic level they take compute that no longer fits onto a single machine, optimize the hell out of it and then run it across thousands of computers. In case you hadn’t noticed half the world seems to need to do that right now.
My first wish for 2025 is for HPC folk to realise this. To step up and out of their niche and use that ability to help solve the same problem in multiple domains that see HPC as a dinosaur and don’t have the first inkling of an idea that we have the solutions to their problems.
The whole world is slowly realising that even if they could buy all the compute they need, we simply don’t have the electricity to run all of it! We are going to have to do more with less. Guess who’s been doing that for decades already?
Power constrained architectures are nothing new in the mobile space but we’re now having to deal with that reality in the data centre. Arm and AMD’s relentless improvements in power efficiency in CPUs doesn’t mean diddly squat when NVIDIA’s GPUs are eating every last electron they can find.
HPC doesn’t have to be a niche offering in the clutches of select few. Let’s Think bigger.
Stop Reinventing the Wheel
Following on from the above, because other non HPC industries are either unaware, alienated or just repulsed by HPC there are a plethora of solutions in the same space being reinvented.
Don’t get me wrong. Choice is a good thing. I’d love to have multiple fantastic solutions to pick from. But I don’t.
Instead, I have a smorgasbord of poorly documented products that do half of what I need. In a particularly user unfriendly fashion. Honestly, I’d give that up for a handful of polished, products with great documentation and community support.
I know I’m being rather critical here but if you measure most HPC products against almost any recent popular software offering, whether that’s developer or user focused, the level of polish in anything HPC is far cry from what we generally expect from modern software.
Go From “It Kind of Works” to “It Just Works”
Something I’ve said to a few people in conversations during 2024 is that HPC needs its “Apple” moment. It needs to go from having functional products that require a PhD to operate to having products that are intuitive, with sensible defaults that are easy to adopt.
If the goal is to achieve wider adoption of HPC solutions to today’s problems those solutions are going to have to be a lot more usable.
Explode the Monolith
A good scheduler may well be the cornerstone of HPC but it’s about time we realised it’s not the only foundation we need. Heck let’s be honest shall we, most HPC isn’t even compute constrained anymore. That’s a dirty little secret we tend to keep to ourselves, and whisper is hushed voices when sipping coffee around the liquid cooled supercomputer.
Isn’t it time we acknowledged it. Isn’t it time we admitted that a data plane is just as important. A node, rack, zone and region aware data plane. And a hardware control plane. That works for cloud, on prem or colo.
But critically I don’t want my data plane to be part of my scheduler. Or my hardware control. No, I want an open standard to allow them to communicate.
As anyone that has attempted to swap schedulers in a HPC system will tell you, the API changes are trivial. The problems come because the choice of scheduler inevitably leaves its mark on the architecture of the rest of the system. The mark of the beast. Can we be done with that already?
I want my scheduler to schedule and be bloody good it at but do no more. I want it to talk to my data plane to figure out where data is to make those scheduling decisions. I want it to talk to my hardware control plane to spin up/down compute as needed.
If we ever have a hope of using ML/AI in HPC infrastructure we need to start by exposing the data needed to train it (more on this later).
Embrace Diversity
To those proclaiming the death of x86, I’d like to remind you that while we may already be calling the death of Intel, the x86_64 architecture was produced by more than just Intel. Not only that, but AMD’s latest (aptly named) EPYC CPUs are more power efficient (when performance adjusted) than their Arm based counterparts.
That’s not a reason to sit back that assume x86_64 is the only game in town. Sure, HPC clusters used to be comprised of dedicated homogenous hardware. Those days are gone. Get used to it. If you want cost-effective, large-scale compute it future you’re looking at a heterogenous compute environment that comprises multiple CPU architectures including Arm, x86_64, RISC_V as well as multiple accelerators including GPGPUs, FPGAs and ASICs.
That leaves us with another problem though. You can bet your compute allocation that scientists researching new cancer treatments (or the engineers trying to model new spacecraft or traders managing their risk) do not give a FLOP which flavour of silicon was used.
The HPC infrastructure needs to remove that complexity. We already have the technology needed to automatically select the most appropriate hardware based on any number of criteria including cost, performance and green credentials. It’s about time to spin that together into something more accessible by the masses.
HPC needs to embrace the DEI agenda and I don’t mean the carbon but rather the silicon based life form version.
HPC & AI
HPC and AI have this really odd incestuous love hate relationship. HPC was necessary for the initial training of modern LLMs. Speak to most AI startups though and they probably won’t know what HPC even stands for.
While LLMs might be attempting to replace traditional HPC calculations in some domains, what I’ve always been interested in using AI (well more strictly just ML) for is to improve the scheduler. I’ve wanted to do that since about the time Google released TensorFlow. Writing a scheduler to take advantage of ML isn’t that hard. What is hard is getting the HPC application that uses the scheduler to expose enough information with each task to be able to train anything.
This is slowly changing though and if some of the ideas above come to fruition then we stand half a chance of having AI optimized HP.
If Terminator or HAL 2000 style AI scenarios already scare you, wait till we can have AI train new models. At that point the current LLM doesn’t need to be able to take over the world, it just needs to be able to train up a new SkyNet.