HPC Cloud Updates WE 25 May 2025

Updates to AWS, Azure & GCP in the last week relevant for HPC practitioners. There are only two hard things in Computer Science: cache invalidation and naming things… and Microsoft proves it’s still terrible at one and leaves the other to you.

HPC Cloud Updates WE 25 May 2025

AWS

This new feature could be handy to swap out grid nodes without too much interruption to workload

Announcing customer-initiated reboot migrations for EC2 Scheduled Events - AWS
Discover more about what’s new at AWS with Announcing customer-initiated reboot migrations for EC2 Scheduled Events

New instances: High memory instances (U-1) now available in Ohio


Azure

Want to know how to use Slurm + CycleCloud with a custom image? Here you go

Slurm custom image for a locked down environment and faster start-up time, Azure Cyclecloud | Microsoft Community Hub
Environment : Cyclecloud: 8.7.1 Slurm project 3.0.11 Slurm version: 23.11.10-2 OS of compute and execute: marketplace Almalinux HPC image gen 2…

GPU failure rates must be getting really bad 😁

Building an Automated Recovery Pipeline for GPU Clusters with Slurm on Azure Part1 | Microsoft Community Hub
Disclaimer: The `slurm-cluster-health-manager` project is a sample tool created specifically for the article it accompanies. It was developed by the author…

Azure Managed Redis (an alternative to Azure Cache for Redis) is now GA.Wow  Microsoft really suck at naming. I guess it just proves how right Phil Karlton was when he said “There are only two hard things in Computer Science: cache invalidation and naming things.”  Oh and the cache invalidation is down to you 😉

Azure updates | Microsoft Azure
Subscribe to Microsoft Azure today for service updates, all in one place. Check out the new Cloud Platform roadmap to see our latest product plans.

Github Copilot for Azure is also GA

Azure updates | Microsoft Azure
Subscribe to Microsoft Azure today for service updates, all in one place. Check out the new Cloud Platform roadmap to see our latest product plans.

Google Cloud

Well we had Google Cloud I/O, I won’t even attempt to cover everything and I’m sure you’ve seen it all in other places anyway. Unless I missed it, there doesn’t seem to have been much about HPC or AI infrastructure though.