HPC Cloud Updates WE 24 Aug 2025

Updates to AWS, Azure & GCP in the last week relevant for HPC practitioners. New Intel VM family on AWS and how to check your GPU VMs actually work before doing any real work with them.

HPC Cloud Updates WE 24 Aug 2025

AWS

If you’re running HPC jobs on batch this change to select default instances to run on may be handy

Introducing “default” instance categories for AWS Batch | Amazon Web Services
Today, we are launching a new set of instance family categories for AWS Batch, “default_x86_64” and “default_arm64″. These new categories represent both a clarification and an improvement upon the existing “optimal” instance type category. This blog post gives some background on the new feature and how you can configure your Batch environments to take advantage […]

We finally have some new Intel based VMs from AWS. r8i and r8i-flex are now generally available

Best performance and fastest memory with the new Amazon EC2 R8i and R8i-flex instances | Amazon Web Services
R8i and R8i-flex instances powered by Intel Xeon processors offer up to 15% better price performance, 20% higher performance, and 2.5 times more memory throughput compared to previous generations.

New Instances: r7g in Cape Town, i7i in Frankfurt, London, Malalysia, Sydney and Tokyo


Azure

This is in relation to running AI workloads but the idea should translate to pretty much any kind of GPU workload either by using an alternative more appropriate benchmark or by assuming that the Llama workload will identify any problems that may arise with your own workload.

What I find rather telling though is the tacit recognition of the level of GPU failures being seen. Whilst this is no doubt useful for end users, shouldn’t companies renting GPUs being making sure they’re fit for purpose themselves before charging people for them? This isn’t a slight on Azure but everyone renting GPUs in the cloud.

Optimizing Large-Scale AI Performance with Pretraining Validation on a Single Azure ND GB200 v6
Small performance gaps on a single virtual machine lead to large and costly performance losses at scale. Running small-scale pretraining jobs enables…

Azure NetApp Files now has access logs

Azure updates | Microsoft Azure
Subscribe to Microsoft Azure today for service updates, all in one place. Check out the new Cloud Platform roadmap to see our latest product plans.

v2 billing model for Azure Files SSD premium now GA

Azure updates | Microsoft Azure
Subscribe to Microsoft Azure today for service updates, all in one place. Check out the new Cloud Platform roadmap to see our latest product plans.

Need confidential VMs? Private preview of DCesv6 and ECesv6 now available

Azure updates | Microsoft Azure
Subscribe to Microsoft Azure today for service updates, all in one place. Check out the new Cloud Platform roadmap to see our latest product plans.

Google Cloud

Not HPC but still interesting from a large scale compute cost perspective

Measuring the environmental impact of AI inference | Google Cloud Blog
A methodology for measuring the energy, emissions, and water impact of Gemini prompts shines a light on the environmental impact of AI inference.