Improving Utilisation in HPC & AI – Part 1
The first in a mini series exploring tips on how to improve utilisation rates of your HPC and AI compute resources
Improving utilisation in HPC can seem tricky but sometimes the simplest of solutions provide highly effective and easy to implement results.
I’ll keep this as generic as I can, but I’ve used the techniques I will cover across multiple schedulers, CPUs and GPUs and compute & memory bound processes. The specifics of how exactly any of these techniques would be implemented will vary from one system to another and may be somewhat dependent on the capabilities of your chosen scheduler.
First up, and a seemingly popular topic at the moment; what you thought was a compute bound process is actually memory bound (shocker huh! 😆 you’ll have to forgive the HPC nerds among us for rolling their eyes 🙄 at this point). The first thing you should try: Oversubscription.
Somewhat counterintuitively, I have had success with this approach for both CPU (and RAM) bound processes as well as GPU (and VRAM) bound applications. One of things we tend to forget is that the operating system itself is remarkably good at scheduling competing workloads on the same host. Leverage this. Let your HPC scheduler oversubscribe each host. A combination in the natural variation of job times and the operating system’s ability to prioritise workload will mean you are able to use “idle” (while accessing memory) compute time. To be clear, I am referring here to workload that is constrained by the speed at which memory can be accessed, not the total amount of memory available.
The exact ratio by which you oversubscribe the compute resource (CPU or GPU) is generally a function of the workload and if you are running multiple different workloads this can get more difficult to manage effectively.
This is also not a get out of jail free card. You will need to balance time to result with throughput as you tweak the oversubscription and changes in job patterns over time need to be monitored to ensure consistent performance. Large (short term) variations in jobs can also result in significantly worse results.
Used properly however (lots of instrumentation and observability of your environment is key!), this is a valuable technique, and I’ve seen significant improvements in performance adopting it in the past.
