ISC25, PPP & Observability in HPC
I’m going to keep beating this drum till we fix it…. We need observability in HPC and its about time we stopped pretending

The other workshop I attended at ISC25 was on the Pain of Parallel Programming. I’ll be honest I didn’t really know what to expect going into this one and I just picked it as I figured if I’m going to pay an eye watering 800 Euros just to attend the Flux workshop then I might as well fill up the morning too.
It turned out the really interesting. Not necessarily because of what I learnt about parallel programming, but more because of what I learnt about how people write code and think about HPC in completely different domains to what I am used to.
For non regular readers, I have spent my HPC life working in financial services (see this for more on the differences). As a result, I mostly sat quietly and listened, learned and compared as the various presenters went over the challenges they see in debugging software (running on supercomputers) for performance problems.

I took the picture above during a presentation on tools that are useful but also cross platform, the idea being that this minimises the pain of retooling just to port the code from say Intel to AMD.
I was quite surprised at the similarities of all the pain points. It all felt somehow familiar even if it was being talked about in the context of weather prediction instead of financial risk models. Even the scale of funding for some of the HPC applications (admittedly USA based) was similar to the costs of a financial risk system. That last point really was a surprise as I’m so used to hearing about cash poor research projects stringing code together to run on million dollar supercomputers.
But there were some interesting differences. The first of these was non technical, and to me at least quite interesting, about the differences in the power structures in play. It seems (and I may well be wrong here) that in the scientific community the researchers writing the HPC code have relatively little power. They write code that conforms to the requirements of a machine procured by someone much more senior and have little choice in the matter.
Quants writing financial models? Not so much. I’ve seen banks procure racks of GPU accelerated servers because a quant decided it would be much faster and cheaper to port the code to CUDA.
There were two statements that really struck me though. Not because of how differently something was being done, but because how much it highlighted what wasn’t being done. Something in my world that is almost unthinkable at this point.
It also drove home to me why I got so little engagement on previous articles about this top. Quite simply because it appears not to even exist for much of the HPC community.
I am of course talking about observability.
The first was a comment around the difficulty of profiling multi threaded code and the variability in performance that introduces. Did you fix the bottleneck or was this run just different? The speaker commented that this was still an unsolved problem for them which was met with general consensus within the room.
This is not just a problem in financial risk systems, it is our every day reality. Almost every risk system I see these days has at least some parts of it that are multi threaded (even on the compute nodes). We have been dealing with this problem for over 20 years. I doubt a single person in FSI would think of it as unsolved though.
The second was a statement around how they will sometimes come across problems in code due to forgotten changes made for debugging such as a barrier left in place. Sure, this happens in FSI too. It doesn’t happen for very long though!
You see, one of the big differences is that within financial services, any computationally expensive code is instrumented to within an inch of its life. That data is collected using varying levels of sophistication and telemetry. In the most rudimentary of cases a developer will look at it now and again and optimise something. In the most sophisticated it forms part of a feedback loop that changes the parameters for every execution of the code.
The first problem is solved simply because you never look at the performance data for a single execution. Looking at aggregate values across multiple runs solves that one easily.
The second problem is resolved through integration of the performance telemetry within your CI pipeline.
I can already hear some of you cringing with pain as you imaging the extra overheads and costs of all this telemetry. It’s worth it. All day long.
In addition to the above it transforms your painful parallel debugging sessions into a few simple queries or sometimes even just a quick look at a dashboard! It means you can make more intelligent scheduling and optimization decisions and ensure those evolve over time as your code drifts. It means you catch that forgotten barrier left in place for debugging or that change that simply made the code 100x slower by mistake.
It’s time to instrument our code.