Slurm User Guide: Job Submission & Management
Abstract: This guide provides standard Slurm job submission templates for cluster users, focusing on script specifications from single-node debugging to multi-node distributed training (PyTorch/DeepSpeed), along with common management commands.
1. Introduction
Slurm (Simple Linux Utility for Resource Management) is the standard scheduler for HPC clusters. Users submit scripts describing their resource needs, and Slurm queues and executes them when resources become available.
2. Basic Job Template (Single Node)
Ideal for debugging, preprocessing, or inference.
2.1 Script (submit_single.sh)
#!/bin/bash
#SBATCH --job-name=demo_job # Job name
#SBATCH --partition=debug # Partition name
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=1 # Tasks per node
#SBATCH --cpus-per-task=8 # CPU cores per task
#SBATCH --gres=gpu:1 # GPUs per node
#SBATCH --output=%j.out # Stdout (%j = JobID)
#SBATCH --error=%j.err # Stderr
echo "Job Start Time: $(date)"
echo "Running on node: $(hostname)"
# Load Environment
module load cuda/12.1
source activate my_env
# Run Command
nvidia-smi
python train.py --epochs 10
echo "Job End Time: $(date)"2.2 Submission
sbatch submit_single.sh3. Distributed Training Template (Multi-Node)
For large-scale training (e.g., 4 Nodes x 8 GPUs = 32 GPUs).
3.1 Script (submit_dist.sh)
#!/bin/bash
#SBATCH --job-name=llm_train
#SBATCH --partition=gpu_p1
#SBATCH --nodes=4 # Request 4 nodes
#SBATCH --ntasks-per-node=1 # 1 Task per node (for torchrun)
#SBATCH --cpus-per-task=64 # All CPUs
#SBATCH --gres=gpu:8 # 8 GPUs per node
#SBATCH --exclusive # Exclusive access
#SBATCH --output=logs/%x-%j.out
# 1. Get Master Node Info (For PyTorch DDP)
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
export MASTER_ADDR=$head_node_ip
export MASTER_PORT=29500
export WORLD_SIZE=$(( SLURM_NNODES * SLURM_GPUS_ON_NODE ))
echo "Master: $MASTER_ADDR:$MASTER_PORT"
# 2. Generate Runtime Script
cat <<EOF > run_task.sh
#!/bin/bash
source /home/user/anaconda3/bin/activate llm
torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=8 \
--rdzv_id=$SLURM_JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
train_llm.py \
--deepspeed ds_config.json
EOF
chmod +x run_task.sh
# 3. Launch in Parallel
srun --output=logs/job_%j_node_%N.log label ./run_task.sh4. Key Parameters
| Parameter | Flag | Description | Example |
|---|---|---|---|
--job-name | -J | Job identifier | gpt_train |
--partition | -p | Queue partition | gpu_h800 |
--nodes | -N | Node count | 4 |
--ntasks-per-node | Tasks (MPI Ranks) per node | 1 | |
--gres | Generic resources | gpu:8 | |
--exclusive | Exclusive node access | - |
5. Cheat Sheet
5.1 Job Management
- Submit:
sbatch <script> - Cancel:
scancel <job_id> - Interactive (Debug):bash
srun -p debug -N 1 --gres=gpu:1 --pty /bin/bash
5.2 Status
- Queue:
squeue -u $USER - Nodes:
sinfo - Job Details:
scontrol show job <job_id>
6. Troubleshooting
Q: Invalid partition name
A: Use sinfo to verify partition names.
Q: Requested node configuration is not available
A: You requested more CPUs/GPUs/RAM than a single node physically possesses.
Q: Hanging at initialization
A: Check MASTER_ADDR resolution, firewall rules on MASTER_PORT, and NCCL health.
