Observability in HPC

If you can’t see what’s going on, what hope do you have to controlling it?

Observability in HPC

Power is nothing without control – Pirelli. I think it applies to HPC too. And it’s hard to control something if you can’t see it.

But first, thanks to everyone who commented and provided additional products and tools to include in HPC catalogue. I have collated all the responses and made the list available here: 

GitHub - hmxlabs/hpc-catalog: A community driven catalog of tools and products that are useful in the world of high performance computing (HPC)
A community driven catalog of tools and products that are useful in the world of high performance computing (HPC) - hmxlabs/hpc-catalog

I have tried to categorise things into various sections in the hope that this helpful, but it is quite possible I may have gotten some of these a little wrong. Feel free to raise a PR or comment here for any corrections or additions to the list.

Back to control: Aside from the fact that there are a LOT more HPC products out there than I was aware of, the other thing that struck me is the lack of solutions in the observability space. In fact, it’s an area that seems to be positively overlooked in comparison to the number of job scheduling options. 

Not only that, but even in HPC installations I’ve come across it is often an afterthought. Tacked on at the end. I’ve seen many solutions ranging from using Splunk or Elastic products to fully home grown and they vary greatly in quality. I have yet to see anything approaching an out of the box solution yet though in this space. 

So aside from spending the GDP of a small country on Splunk anyone got any suggestions on good HPC observability solutions? Especially ones that will work across on-prem and cloud (and play nicely with the egress troll)?