Metered Innovation
What impact does using a metered resource (such as the cloud) have on creativity and innovation?

There’s no question in my mind that you can do HPC on the cloud. But sometimes, having to use a metered resource (and the cloud is metered is so, so many different ways!) can stifle creativity and innovation a little.
I created the COREx benchmark (Github) at the tail end of 2022, so getting on for almost three years ago. In that time, it has been run on many thousands of cloud VMs.
You’d think that by now I’d have a very good handle on its characteristics, right? I did too.
Then I got access to a 512 vCPU machine without having to pay for it by the minute. I ran our standard benchmark runs which were complete in around two hours. But I still had access to the machine. Now what?
I started playing. I ran COREx in configurations that I haven’t ever bothered to in the past, because conventional wisdom says that it would make no sense. Performance would be terrible. Problem is conventional wisdom hasn’t caught up with CPUs that have as much L3 cache as you had RAM in the past. And as much RAM as you once had local storage.
As we all know, when it comes to performance analysis there is no substitute for actually doing the tests.
Financial risk systems today (which COREx emulates) usually evolved from running overnight risk into also running ad-hoc simulations and all too often real time intraday risk. All from the same underlying quant analytics libraries and risk system. Sure, there are some minor differences in the configuration of the grid and distribution or priority of workload but the underlying code is the same. And optimised the same way. Especially in terms of things such the use of SMT and process to vCPU ratios. (Yes you can play with that as a factor).
At the risk of stating the obvious, real time risk needs to be optimised for latency whilst end of day risk is more concerned with throughput. Would you look at that, you need different configurations for each. Not only that, but to my surprise, you can achieve higher throughput by overloading the process to vCPU ratio a little. (Because, COREx, like real risk systems, is not perfectly CPU bound 100% of the time and the L3 cache is massive).
I guess it really shouldn’t be a surprise, but having never run these kinds of experiments before, it was a little. I’m also not aware of anyone that has run this kind of analysis and is running different configurations for intraday risk vs end of day.
I swear I will finish this white paper at some point, but I don’t think I would ever have even tested these ideas on a metered resource. That’s without even accounting for the increase in the number of tests runs that would have been required to compensate for not having the exact same conditions for each run as can be assured with physical access to metal.
Which made me wonder, yes sure you can do HPC for research in the cloud and there are cases where it’s also the best way to do it. But I don’t think it would be a good idea though to lose access to an unmetered “free” resource. There’s no question in my mind innovation suffers.