SLURM Training Jobs on HPC Clusters
An HPC training run is a resource request, a batch script, an environment, and a reproducibility contract.
Site connection
The SLM research project used Rutgers Amarel HPC and SLURM batch scripts to distribute training and evaluation jobs.
Visual model
Job eligibility under resource constraints
Change GPUs and wall time to see which jobs can fit a scheduler request.
Interactive
Schedulers fit jobs into finite GPU, memory, and wall-time budgets
A minimal sbatch script has three layers:
#!/bin/bash
#SBATCH --job-name=slm-sft
#SBATCH --gres=gpu:2
#SBATCH --time=04:00:00
#SBATCH --mem=64G
module load cuda
python train.py --config configs/gsm8k-sft.yml
The scheduler reads the directives, queues the job, and runs it when resources are available.
What the Scheduler Decides
SLURM does not just run code. It decides when a job can run based on requested resources, partitions, limits, priority, and cluster availability.
A request for too much time or too many GPUs can sit in the queue longer. A request for too little can fail mid-training.
Reproducibility for Training
A good training job records code commit, config file, dataset version, environment, seed, hardware, and output checkpoint path.
Without those details, a result like 'GSM8k improved' is not reproducible. The cluster job becomes a one-off event rather than a scientific artifact.
| Record | Why it matters |
|---|---|
| Git commit | Recreates code state |
| Config | Captures hyperparameters |
| Dataset version | Prevents silent data drift |
| Environment | Explains library and CUDA behavior |
| Job ID | Links logs and scheduler metadata |
Common Pitfalls
- Requesting resources without measuring actual usage.
- Training from a dirty or unrecorded code state.
- Writing checkpoints to temporary storage accidentally.
- Ignoring failed or preempted jobs in results.
Quick check
Quiz
What does sbatch do?
- Submits a batch script to Slurm
- Plots a loss curve
- Creates a vector embedding
- Builds a web page
The Slurm sbatch command submits batch scripts to the scheduler.