Will HPC Survive AI

I’ve debated writing this for a long time. The idea has sat in my list of topics to write about for at least a year if not longer. Then last week Addison Snell’s article in HPC Wire brought the subject to the fore and that has given me a little of the impetus required to put finger to keyboard. Whether I am brave enough to hit the publish button remains to be seen. If you’re reading this, then I guess I did.
I’m told by social media guru’s that you won’t ever please everyone so just write what you want. I hope they’re right. I’ve timed this to go live during ISC expecting that the classical HPC crowd will miss it 😆 but if I’m wrong, since I will also be at ISC for the last day I may have to be in stealth mode!
While Addison’s (very good) article is grounded in solid data, mine is based on little more than anecdotes and conversations I’ve had over the last ten to fifteen years of my life working in what I will refer to as HPC, choosing as I do, to use that term as a broad umbrella rather than in its classical sense of supercomputing with inter-process communication using MPI.
As such I will start this with the very large caveat that this is, simply my opinion, and a result of my experiences. I may well be wrong, and I may be alone in this opinion, but I suspect I am not.
Before I delve into AI though, I want to talk a little about something that Addison briefly touched on, HPC in financial services.
The article (incorrectly) states HPC in finance grew due the arms race in high frequency trading. In a pre 2008 financial crash era the growth was primarily driven by the increasing popularity of complex derivative products. These were expensive to value and run risk metrics on. Some product types, such as FX Tarns, required hours of compute time for a simple present value calculation. In a post 2008 world the demand was driven by a combination of increased regulatory and reporting requirements and the adoption of full revaluation value at risk models that required running years of risk daily in order to reduce the bank’s capital requirements.
The distinction is important. In a pre 2008 era, budgets were lax. Post 2008 budgets were constrained, this meant the way in which growth in HPC happened between these two eras was markedly different. AI today is very much in the equivalent of the pre 2008 era. We may yet see it transition to the post 2008 model if (when) the AI bubble pops.
An important distinction is also that pre 2008 was before mass adoption of the cloud. It was an era in which large financial services firms were still amongst the largest buyers of compute and were able to dictate the direction of the market (to some extent).
During this time money was lavished upon new quant libraries to value derivatives, risk systems (complete with their own scheduling) to run those at scale and of course hardware to run it all on.
You know what didn’t happen? No one went out and bought Cray hardware. No one used Slurm or PBS or HTCondor.
Within financial services those were seen as old hat. Legacy technology. For engineers building the risk systems and their CVs it wasn’t going to get a look in. Sometimes we will still get called to modernise such systems.
There was enough disdain for supercomputing that it was possible for companies such as Platform Computing and Data Synapse to establish themselves as a two horse race for commercial replacements to in house scheduling.
I can already hear you shouting though, that’s not HPC! There’s no MPI. Its running on commodity Dell hardware. There’s no high speed networking. And sure, you’d be right. But during this time, I saw compute clusters that would have comfortably been in the top 10 of the Top 500. You sure that’s not a supercomputer? Want to tell me how many of your jobs running on El Capitan today use MPI and across how many nodes?
The animosity and mistrust seemed to stem from both sides.
Sure, financial risk used supercomputing/ HPC. Did you hear them brag about that? No. In fact the traders and risk managers reliant on underlying technology probably couldn’t even spell HPC. You might as well have told them you were using a database of some form. It was about as exciting.
Some of the developers of the risk systems may have been familiar with HPC and grid computing concepts but even they were busy building distributed systems and service based architectures. Modern software engineering.
The bank’s central grid teams (where such a function even existed) were probably the only ones who even remotely associated themselves with HPC as a technology stack.
And this is where we find ourselves today with AI. Within the HPC and Supercomputing community we make a great deal of noise about how HPC is responsible for the growth of AI. How it’s the next killer application (as per Addison), all the while the AI community looks down its nose at HPC and builds out alternatives like Run:AI which was acquired by NVIDIA for a cool $700million. Not to mention other companies such as MemVerge also vying for the same space.
The hardware and silicon manufacturers have long since pivoted to those with larger wallets and are producing hardware that directly targets AI training and inference rather than the traditional scientific workloads of classical HPC.
Much like banks pre 2008, AI companies and hyperscalers are simply throwing money and hardware at the problem. We are starting to see glimpses of this in low utilisation rates too.
Even if the AI bubble were to pop tomorrow, I suspect there is a large enough installed base of GPUs and enough entrenchment in the current operational and scheduling paradigms that we won’t see that shift.
I see a few sparks of hopes here and there, now and again . Some of the biggest shifts in software engineering seem to have made their way into HPC such as containerisation. But none of that will change the image that HPC already has.
I’m in the midst of launching a new initiative around HPC/supercomputing and shared the name with a few people and one of the consistent comments was if you want to make any real money associate it with AI infrastructure rather than supercomputing.
In answer to Addison’s question, HPC has found its killer application, now what? I think it’s too late. I think HPC already missed the boat. Mostly.