Misadventures in HPC Cloud Migration #13
VM Boot Times
Its 5pm. It’s been a rough day in the financial markets. Your least favourite trader has just closed his books and kicked off his PnL. 3 minutes later the support phone rings asking why on earth he’s still waiting. Sound familiar?
He’s waiting because you migrated to cloud and are using dynamic scaling to save costs. Its 5pm. There’s no capacity online because there was, till now, no demand. He’s waiting for the VMs to start up.
I’ve previously shared a graph that shows that even on the slowest cloud platform it is possible to commission an entire new environment and start up a VM in under 2 minutes. So why is your trader still waiting?
Because you’re using a custom machine image. That takes about a week to boot.
Moving from an environment that had always on capacity to a dynamically scaled one also means you boot your machines (even create them) much more frequently. Your on premises machine image that registers with five different inventory management systems, performs six CVE scans, ask for authorisation to boot and join the network from three different services and then proceeds to start fifteen monitoring agents before starting any user processes needs to die. No really it does.
Not only will your least favourite trader from above be happy, you’ll lower your cloud bills too. Yep, you pay for those long boot times.
Firstly, dynamically scaled VMs used for HPC that live for hours before being shut down really do not need to be tracked and managed the same way as other long lived cloud VMs. The cloud provider can give you a list of them any time you like. You don’t need to register them in your inventory tracking system. They live for a period of hours. They don’t need CVE scans. Just enforce a policy on the machine image to ensure there are no vulnerabilities. The machine will be gone in a few hours.
HPC worker machines shouldn’t be treated like the rest of your cloud estate. Do your bottom line a favour and figure out how to use a standard marketplace image.