Skip to content

SLURM vs Kubernetes vs Nomad

  • Features: SLURM is primarily focused on just scheduling, whereas Kubernetes has a scheduler (kube-scheduler), distributed configuration (etcd), and control loops / negative feedback loops (kube-controller-manager; a la Norbert Wiener / cybernetics) plus custom logic to deal with the fact that the unit of scheduling is a grouping of containers (a pod) instead of a group of processes (i.e., containers need to have networking, DNS, etc configured dynamically). There is, however, no reason why SLURM could not also have some of these features in theory (i.e., confd could be combined with SLURM and etcd to give SLURM a distributed configuration that is dynamically updated).
  • Unit of Scheduling: SLURM schedules jobs (typically one or more processes), while Kubernetes schedules pods (groups of containers). Jobs can have resource isolation via cgroups, but don't have namespace isolation or any fancy logic to deal with filesystems (i.e., union file systems). Basically, jobs have 1/3 of what makes up a container. If namespace isolation and union filesystems aren't important, SLURM has a compelling advantage over Kubernetes due to having less complexity/overhead.
  • Scheduling Algorithms: SLURM uses a priority queue or a backfill algorithm. Kubernetes filters available nodes, assigns them a priority score, and then pops off the node with the highest score (so also a priority queue most likely; see here). Hashicorp Nomad uses a bin-packing algorithm (see here).

SLURM / Kubernetes Analogies

Kubernetes SLURM
Namespace Partition
kubelet slurmd
kube-scheduler / kube-controller-manager / slurm-apiserver slurmctld
Container Runtime Vanilla cgroups
etcd slurm.conf + NFS
Prometheus / ELK slurmdbd (?)
Labels / Selectors GRES / TRES / Features
Marking a node as unschedulable / cordoning Draining a node / Making a resource reservation

Further Reading