Heterogenous or Homogenous HPC?

A look at recent developments in CPUs and GPUs and what that means for HPC in the future

Heterogenous or Homogenous HPC?

For the last twenty years if you had stuck your head in the sand and decided to just run everything on Intel x86 you’d have been fine. I don’t think that’s going to be true in the next twenty years.

Let’s talk silicon. With everything that’s happened in this space just this week how can we not?

In just this week we’ve had news

And that’s just this week. 

During the course of this year we’ve seen Google launch its own ARM powered silicon in the form of Axion  and NVIDIA is bringing us the GB200 superchip.

In the consumer space we’ve obviously had Apple Silicon seriously disrupt x86’s dominance in end user computing in much the same way ARM chips previously have for mobile devices.

More and more we are looking at an increasingly heterogenous future for our compute needs. Not only does that mean a mixture of CPU and GPU but also a mixture of architectures in the CPU in the wild.

To my mind there are a few macro tecno-economic factors driving this change. Not least of which is the insane compute demands of the current AI hype cycle. Anyone following my writing will already know I’m not a believer when (and not if) we see that hype train fall into the trough of disillusionment much of that demand will vaporise. There will remain however two other important factors that will continue to drive diversity in the CPU space.

The first of these is the CHIPs Act and increasing global geopolitical tension. Since the USA relaxed export controls around encryption technology in 1999 we’ve essentially been in an era where the world’s technology has been provided mostly by one country, and in reality, the CPUs by one company, Intel. The CHIPs act has put an end to that. Not only are countries looking at investing in sovereign AI capabilities, but that also often comes with sovereign silicon capability to power that AI and by extension, other compute. If we continue down the path of increased export controls we are likely to see an increase in development of alternative silicon. Even if other countries and companies are not successful in developing competitive substitutes, the export controls themselves will transform the way we see compute. It will become even more heterogenous across regions and global corporations will have to distribute compute based on a complex algorithm with data residency legality and performance requirements as inputs (more than we already do today).

The second is the slow transition of compute to commodity status and the expected behaviour from the new cloud utility companies to counter this and retain their profit margins for as long as possible. As we’ve seen in other industries in the past, cloud utility providers will attempt increasingly to differentiate their offerings to entice customers and increase switching costs. Custom silicon is a natural extension to this and as such the large proportion of AWS’ new capacity being Graviton and their investment in Tenstorrent should come as no surprise.

So, what does all this mean for HPC? I think there are two possible paths, and we may follow them both.

Traditionally HPC workloads have been run on large homogenous (on prem) clusters. The typical approach has been to purchase a large quantity of silicon, all the same shape and flavour, and deploy this with a scheduler capable of distributing workload across it all. More recently though we’ve seen green shoots of heterogenous HPC (including the use of cloud) where multiple types of compute may be used.

We will either see this trend drastically accelerate and embrace highly heterogenous HPC or we will see a re-entrenchment back to owning and controlling a homogenous set of hardware.

The path we follow here will depend greatly on the how the scheduler industry steps up (or not). Quite frankly your average scientist, engineer or quant doesn’t care what kind of silicon their workload ran on. They just want it done as fast as possible. If we are able to abstract away that complexity, if we are able to produce portable workloads that can run across multiple architectures then we may well see the former use case.

On the other hand, if we remain stuck with primitive schedulers with limited visibility into their workload’s characteristics and old codebases capable of running only on a single architecture then you can guess where we will end up.