Observability in HPC

Gaining insights into your HPC workloads, but how? Anything that works out of the box?

Observability in HPC

What observability tools are you using for HPC?

I’ve seen logs and telemetry data dumped to a file share (and you better you know your way around grep, sed and awk if you want to make sense of it). I’ve also seen every possible metric captured in Splunk, tabled, graphed and charted. And everything in between.

The variation is enormous but the difference in capabilities it can provide are even bigger. From having no idea how much of your (very expensive) computer estate is even used… To figuring out one rack in your data centre is missing blanking plates, so isn’t getting cooled properly, and resulting in lower performance which shows up on your charts like a sore thumb (real example!).

What I haven’t really seen though is a product that works out of the box in HPC environments. To be clear I don’t mean to just collect metrics. There are several that can do that. I mean specifically for observability. A ton of metrics sat in Clickhouse, ELK or Splunk on their own are no good to anyone. You might as well light $100 bills on fire to keep yourself warm while you try to find patterns in the matrix.

Do turnkey solutions in this space not exist? Or have I just not seen them?

I thought maybe part of the problem is the large variation in possible implementations. If I look at financial risk systems, they vary a lot. But there are common themes that exist in all of them. Surely there are commonalities across HPC applications that can be drawn out to provide observability platforms too? Perhaps built on top of, or extensions to, popular metrics/ telemetry platforms.

Why can’t I deploy something, and have it show me. and I mean show,  with pretty pictures in glorious technicolor, insights into my data. At the very least I would have thought the basics like CPU hours sliced by application and hardware. Same for CPU and memory utilisation and bunch of other things I’m sure I could list.

All that, without any extra effort from the user. Without having configure a million properties or write pages of Splunk searches? (Yes I know I can do it in Splunk with a bit of work).

Is it genuinely too hard? Is there no money to be made doing it? Is HPC too much of a dinosaur for anyone to care enough to build it?