Improving Utilisation in HPC and AI – Part 3

Part 3 in the mini series on improving utilisation rates in HPC and AI. The first (but not the last) on dealing with fragmentation impacting utilisation rates.

Improving Utilisation in HPC and AI – Part 3

Fragmentation. The bane of many an HPC/ AI infrastructure engineer’s life. You start off with an empty supercomputer, and the first run aligns perfectly. Every core is used, every GPU humming along. It only goes downhill from there.

Jobs complete, leaving gaps only to be partially filled by other jobs in the queue as they aren’t quite the right size. Over time, things only get worse (if left to their own devices). It’s like playing Tetris but instead of jobs disappearing from the bottom when you fill up a line they vanish from anywhere at random times. But you can also place them wherever your like, they don’t have to drop in from the top.

There are of course solutions to this, and we will look at few over the next few articles, but the first is easy. Cheat.

Like many HPC optimisation techniques, this one may seem counterintuitive at first but the easiest way to minimise the problem is to rig the game. Make all blocks are the same size. Any time one disappears, the next one in the queue is always the same size. 

This does of course mean that your applications need to change. The application owner will probably tell you it’s not possible. That’s only sometimes true. It’s may well be correct that splitting up a single application is slightly suboptimal for that application but, the overall usage of the compute will be much higher. Worst case scenario, the jobs can be multiples of a base size. That makes life a little harder but still better than a collection of randomly sized jobs.

Of course, this technique can be extended to multiple standardised sizes and then potentially combined with previously mentioned techniques to share resources or segregate them.

The hardest part of this is cajoling the application owners to modify their code. To make things worse, usually the person that has to do that has neither the authority nor the budget to drive that change. There are levers you can pull to nudge things in the right direction even so.

P.S: After seeing me write this article my 8yr old daughter confidently told me that the game is in fact called Block Blast and Tetris must be an old name 😆. I guess I’m old.

Improving Utilisation in HPC & AI – Part 2
Part 2 of how to get more from your expensive HPC and AI compute resources without spending more money