Can Your HPC Move to Cloud?

Looking at a HPC refresh and wondering if you can move to the cloud? We’ve got a few tips you can follow to work it out.

Can Your HPC Move to Cloud?

I think at this point, the previously commonly held belief that HPC on the cloud is not possible has well and truly been put six feet under by the cloud AI wallahs building supercomputers that make El Capitan look like your high school desktop.

This probably means, you can make your HPC work in the cloud too. Probably.

Thing is, probably doesn’t cut it when you have to decide if you write a cheque for a few new 1MW racks filled with NVIDIA goodness or sign up for monthly subscription that looks like a small country’s Netflix bill.

Time to try and build a bit more certainty before crunching those TCO numbers or pulling out the sketch pad and talking architecture or scheduler choices.

Right, let’s get boring. The sad reality is that most of what determines if you can migrate to cloud isn’t anything HPC specific but rather mundane. Let’s get that out the way in this episode.

Who’s who and what can they do? (IAM)

Almost any organisation large enough to have a HPC cluster will also have enough users and applications to have some form of identity management and access control solution. One that goes further than everyone just winging it with usernames and passwords. Whether that’s Active Directory or Zentyal (and everything else in between) the first question you need to ask yourself, and probably your help cloud solution architect, is how will this integrate with the access control to your new cloud resources.

Can your applications, system accounts and real people authenticate to the cloud using the same credentials they already have? How will those credentials be kept in sync?

Top tip, go deep here. Whilst this seems simple, it can end up breaking for all sorts of fun reasons. My favourite so far has been a case where the HPC compute nodes are always run on spot capacity, however this necessitated a different machine image, which for performance reasons could not include all of the standard security software used on premises. The upshot of this was that the machine could also no longer connect to the KDC on premises to obtain a Kerberos token. Any workload that relied on access to network resources was dead in the water.

Networks be crazy

The standout feature of the cloud is that it’s not on your premises, so you don’t have to power it or repair it. The problem with cloud is that it’s not on your premises so you need to figure out how to connect to it.

We’re not talking about giving BT or AT&T a call and asking for some ADSL either. Especially for data heavy HPC workloads or large numbers of users you’re going to need something heavy duty. Even if you were to go all in on cloud and run all your HPC there, you will still almost certainly always have data on premises that it needs access to. Whether that’s input data from other systems or copious quantities of research sitting in a database, it will need to wing its way over to your cloud compute.

There are a few gotcha’s here again. Not only do you need to check what kind of connectivity is available between your site and your cloud region(s) of choice but also things such as what the bandwidth of the network equipment (such as firewalls). Both on premises and in the cloud.

Functionally speaking the network topology both on premises and in your new cloud deployment (and hopefully you’ve moved past a single open network at this point) is key. Ensure you will have a route from every potential HPC user to your cloud VPC hosting the new compute, that you don’t have any overlapping private subnets and of course that it complies with your security architecture.

Last of all, don’t foget the fun that is DNS. It quite likely you will have multiple DNS zones across cloud and on-prem and resolving cloud host names from on premises and vice versa isn’t magic that will just work. This can be particularly problematic if you haven’t carefully managed your (sub)domains for internal use and have workloads that need to span cloud and on-prem without using the same configurations and hostnames. The best planned cloud adoption can quickly turn into a BOFH’s playground managing hosts file entries and IP addresses in. Especially if they are IPv6 addresses.

Cupboard Space & Locks

You’ll need a good idea of not only how much storage you have available on premises in its different flavours. This needs to include not only technologies (file systems, databases, No SQL, in memory caches) but also the different performance tiers and security requirements. Then you get to map how those translate to their cloudy equivalents and how that data will get there.

And what it will cost to keep it there.

The Gatekeepers

If there’s one thing that’s more critical than all of the above though, it’s making sure you have support from the right people and the right levels of power.

Nothing will stall your cloud adoption faster than an uncooperative CTO (or CIO or head of research computing or equivalent in your organisation).

That’s not your only worry though. Security reviews, intransigent application developers, high profile researchers with their own ideas or quants with a bigger budget than yours can just as quickly derail things no matter how technically or economically proficient your plan might be.

Dealing with these challenges has to be as much a part of your plan as figuring out how you’ll direct workload between the cloud and on premises. Don’t forget, it’s a different skill set too. Make sure your team includes it.

Passed that little test? Now we can start thinking about the actual fun stuff. And what it will cost.

This article is part of a long running series on cloud migration:

Misadventures In HPC Cloud Migration - Flux by HMx Labs
Tips for Migrating HPC to Cloud