Too Many Schedulers, Not Enough Patience

Need to pick a HPC scheduler? Want to collaborate on a tool to help you?

Too Many Schedulers, Not Enough Patience

How do you pick a HPC scheduler? Schedulers play a fairly critical role in HPC and AI infrastructure but they’re also an area where few people have experience with multiple products let alone every possible product.

I’ve worked in HPC for close to twenty years and when I crowd sourced the HPC Catalog last year there were schedulers added to the list I wasn’t even aware existed.

This was precisely this conundrum that Alex Kimber invited me to have a chat about last week. Better than that though. He had an idea to help.

What if you could start your scheduler selection process by narrowing down the list of products to the ones that can at least meet your requirements.  Answer a few simple questions and eliminate some from the list based on their feature set. Seems simple enough right?

Alex even had an Excel sheet ready to go as a prototype to demonstrate the idea and we have agreed to try and make it a reality. But with your help. Yes you.

We have an initial list of questions (and therefore corresponding features) but we’d like your input to extend and refine this list. We will then take that and create an open source tool and host it online for use (free).

If you’re vendor of one of these products then I’d love your input on making sure we have the correct features set. If we get it wrong or you enhance the scheduler a change is only a pull request away.

Here’s what Alex and I have so far and we’re looking forward to your input to round this out.

  1. What is your typical task duration? [seconds to minutes | minutes to hours]
  2. What is your task submission rate? [ < 500 | 500 to 1000 | > 1000 pers second]
  3. Do you require Microsoft Windows support? [no | compute nodes only | compute nodes and scheduler hardware]
  4. Which CPU architectures do you need to support? [ x86_64 | arm64 | risk ]
  5. Do you require GPU support? [ no | NVIDIA | AMD | NVIDIA & AMD ]
  6. Do you require support for other accelerators such as FPGAs? [ no | yes ]
  7. Do you require support for containers [ no | yes ]
  8. Where will the scheduler and compute nodes run [ on prem only | AWS | Azure | GCP | Other Cloud | Multi Cloud/Hybrid ]
  9. Do you require cloud orchestration as part of the scheduler [ no | yes ]
  10. Do you require a fully managed cloud based solution [ no | yes ]
  11. Do you require data aware workload scheduling [ no | yes ]