Spot Capacity, Schedulers and Gen AI
Weaving together three unlikely aspects of the cloud
Spot capacity is great. You get cheap compute; the cloud vendor gets to use some spare capacity and make a bit more money. Everyone is happy.
Sometimes, and more often than you’d think, you can even get away with pretending spot capacity can’t be reclaimed at any point and still run your long running workloads. I’m not recommending you do it, but if you can get away with it, sure why not! Checkpointing or breaking down to smaller tasks can be a lot of work so the temptation to ignore that requirement is strong.
Ultimately though if a scrappy AI startup comes along and eats all your GPUs or power problems mean the hyperscaler decides to turn off half the region you operate in and your spot capacity suddenly becomes a lot sketchier than it used to, you’re on your own. That was the deal. If they want it, they get it back and you get evicted. Whenever they like. Even if you’re in the middle of million dollar calculation. They’ll play you the smallest violin in the world if you complain.
Not a problem, you move your workload to a different region or VM type or maybe even vendor. It isn’t really that hard to do after all right. Here’s my question though. Why, as a user of large scale cloud compute, am I still having to even think about this. Why isn’t this basic compute infrastructure that I just use?
Cloud infrastructure has made huge advances over simple on prem servers of yore. K8 and serverless functions to name but two. Yet I still can’t just ask for “compute” and let the system figure out where it should land (within some parameter I can define).
I guess AWS, Azure and Google are too busy giving us Gen AI instead.