MemVerge reposted this
If every AI dev you support gets their own dedicated A100 or H100 GPU for their Jupyter or VS Code project, you're probably doing it wrong. If you're in charge of your team's GPU cluster, check out NVIDIA's Multi-Instance GPU (MIG), a feature that works with A100, H100, and newer GPUs regardless of if you use k8s or Slurm to manage your cluster. Using MIG makes it possible to support 7 or potentially even more AI developers on a single A100 (see table below as a reference example). Now if your end users don't have a great grasp of how much GPU resources they really need, k8s starts to shine relative to Slurm (which supports a more fixed and inflexible allocation each time a "job" and "reservation" is granted). Even though prices have come down, GPU resources are still quite expensive and there never seems to be enough when the bigger training and inferencing projects start. Use MIG today and don't let your dev's get away with hogging more resources than they need. I'll include some helpful links in the comments.