The Bi-Modal Nature of Benchmarks

Benchmarks can serve two subtly different purposes, are you sure you’re getting what you need from yours?

The Bi-Modal Nature of Benchmarks

Is your benchmark telling you what you think it is?

We all know that a benchmark is a bit of code that allows us to compare the relative performance of machines. Is my Macbook faster than your water cooled, home built gaming rig? Not all benchmarks are created equal in this regard though. 

For a moment let’s put aside how representative of your real workload the benchmark is and pretend there’s a perfect correlation. Yea I know you might have to pretend really hard. Do your benchmark results show the difference in performance you can expect between different hardware for your workload? Or do they tell you how well the benchmark can be optimised for a particular type of hardware?

Once upon a time those two things may have been the same. You bought your supercomputer and knew that all your code would run on the a few thousand copies of the same CPU. You optimised your code for that CPU model, and all was right in the world.

Then you bought a new supercomputer. Your code ran even faster. Happy days. As fast as it possibly could on that new hardware with bigger CPU caches, more bandwidth and faster interconnects? Probably not. Did anyone optimise it all again for the new machine? Maybe.

Then you moved to the cloud. At least sometimes. Which CPU was your code optimised for now?

Intel and AMD both have their own versions of HPL, that’s great if you’re trying to compare the best possible performance each CPU could provide. It ideal if you will install a fixed compute base and optimise all your software for that CPU.

What if your code was written twenty years ago, had a messy divorce with the profiler and they haven’t seen each other in 10 years? What if, maybe to minimise compute costs or to get the largest possible amount of available spot compute capacity on cloud, your code can go from running on Intel CPUs one day to Graviton the next? Which one are you optimising for?

Maybe you have enough time and money to have multiple implementations each one tailored to every CPU you can use. You probably don’t though.

Suddenly you don’t want your benchmark to give you’re the hero numbers for each CPU that are possible but what you’ll see if just throw your crufty old code at it, as is.

This was the backdrop that COREx was written against. I’ve had people offer to optimize it to run faster for a given hardware platform, but to me that would go against the purpose of it.

Your best benchmark will always be your own code and its easier than you think to create a benchmark out of it.