AWS PCS Running Costs

A slightly tongue in cheek look at what it would cost to run AWS’ new PCS

AWS PCS Running Costs

I thought I’d take a slightly closer but cheeky look at AWS’ new Parallel Compute Service. I was curious as to how it compares against using a commercial HPC scheduler and/or paying someone to manage your HPC estate.

Let’s pretend for a minute we ran something at the scale of CERN on it. That’s half a millions CPUs. You can’t actually get to that number if we use hpc7a instances as the controller in its largest size has a maximum of 2048 instances. We could use U-6tb but it would probably be cheaper to build another particle collider so let’s just settle for 2048 hpc7a giving us almost 400,000 vCPUs.

First up, the controller cost. A whisker under $5,000 a month. That’s not bad, probably be hard pushed to run the hardware on AWS to run SLURM on for much less. So far so good.

Next, the node management fee. That’s a hefty $123,000 (just under really) a month 😮 . To keep your 2048 VM instances patched and up to date. Seems expensive… but I guess if you’re paying $11million a month for the EC2 instances maybe that’s just a rounding error? Even so, probably cheaper to pay someone to be manually updating them as their full time job. 

That’s not a particularly realistic example though. Let’s pick something that is. A systemically important bank that runs on Slurm 😆 yea you got me maybe CERN running their compute grid on AWS is more likely.

Regardless, let’s say we get away with a medium sized controller running 512 hpc7a instances. We don’t need to be running 24x7 either. An average of 12 hours per day for 22 working days per month is closer to reality. The controller cost is laughably low at below $900/month. You definitely wouldn’t run the HPC scheduler hardware on less than that a month let alone pay for the HPC scheduler license. Bargain.

Even the node management costs come in at a palatable $11,000/month. If you could genuinely reduce headcount on the HPC team that might be worth it. On the other hand, a few scripts and VM lifetimes limited to n days would solve the problem even more cheaply and leave your engineers in a job and able to fix other problems.

Of course the real problem here is that it would be a ten year multimillion dollar project to move from whatever scheduler the bank is really using to PCS/ Slurm.