AI infrastructureIntermediate

SLURM Training Jobs on HPC Clusters

An HPC training run is a resource request, a batch script, an environment, and a reproducibility contract.

SLURMHPCTrainingAI research

Site connection

The SLM research project used Rutgers Amarel HPC and SLURM batch scripts to distribute training and evaluation jobs.

Visual model

Job eligibility under resource constraints

Change GPUs and wall time to see which jobs can fit a scheduler request.

Interactive

Schedulers fit jobs into finite GPU, memory, and wall-time budgets

tokenize1 GPU / 1heligible
sft-run2 GPU / 4heligible
eval1 GPU / 2heligible
quantize1 GPU / 1heligible

A minimal sbatch script has three layers:

#!/bin/bash
#SBATCH --job-name=slm-sft
#SBATCH --gres=gpu:2
#SBATCH --time=04:00:00
#SBATCH --mem=64G

module load cuda
python train.py --config configs/gsm8k-sft.yml

The scheduler reads the directives, queues the job, and runs it when resources are available.

What the Scheduler Decides

SLURM does not just run code. It decides when a job can run based on requested resources, partitions, limits, priority, and cluster availability.

A request for too much time or too many GPUs can sit in the queue longer. A request for too little can fail mid-training.

Reproducibility for Training

A good training job records code commit, config file, dataset version, environment, seed, hardware, and output checkpoint path.

Without those details, a result like 'GSM8k improved' is not reproducible. The cluster job becomes a one-off event rather than a scientific artifact.

RecordWhy it matters
Git commitRecreates code state
ConfigCaptures hyperparameters
Dataset versionPrevents silent data drift
EnvironmentExplains library and CUDA behavior
Job IDLinks logs and scheduler metadata

Common Pitfalls

  • Requesting resources without measuring actual usage.
  • Training from a dirty or unrecorded code state.
  • Writing checkpoints to temporary storage accidentally.
  • Ignoring failed or preempted jobs in results.

Quick check

Quiz

What does sbatch do?
  1. Submits a batch script to Slurm
  2. Plots a loss curve
  3. Creates a vector embedding
  4. Builds a web page

The Slurm sbatch command submits batch scripts to the scheduler.

Sources and Further Reading

Related Explainers