Slurm User Guide: Job Submission & Management

Abstract: This guide provides standard Slurm job submission templates for cluster users, focusing on script specifications from single-node debugging to multi-node distributed training (PyTorch/DeepSpeed), along with common management commands.

1. Introduction

Slurm (Simple Linux Utility for Resource Management) is the standard scheduler for HPC clusters. Users submit scripts describing their resource needs, and Slurm queues and executes them when resources become available.

2. Basic Job Template (Single Node)

Ideal for debugging, preprocessing, or inference.

2.1 Script (`submit_single.sh`)

bash

#!/bin/bash
#SBATCH --job-name=demo_job           # Job name
#SBATCH --partition=debug             # Partition name
#SBATCH --nodes=1                     # Number of nodes
#SBATCH --ntasks-per-node=1           # Tasks per node
#SBATCH --cpus-per-task=8             # CPU cores per task
#SBATCH --gres=gpu:1                  # GPUs per node
#SBATCH --output=%j.out               # Stdout (%j = JobID)
#SBATCH --error=%j.err                # Stderr

echo "Job Start Time: $(date)"
echo "Running on node: $(hostname)"

# Load Environment
module load cuda/12.1
source activate my_env

# Run Command
nvidia-smi
python train.py --epochs 10

echo "Job End Time: $(date)"

2.2 Submission

bash

sbatch submit_single.sh

3. Distributed Training Template (Multi-Node)

For large-scale training (e.g., 4 Nodes x 8 GPUs = 32 GPUs).

3.1 Script (`submit_dist.sh`)

bash

#!/bin/bash
#SBATCH --job-name=llm_train
#SBATCH --partition=gpu_p1
#SBATCH --nodes=4                     # Request 4 nodes
#SBATCH --ntasks-per-node=1           # 1 Task per node (for torchrun)
#SBATCH --cpus-per-task=64            # All CPUs
#SBATCH --gres=gpu:8                  # 8 GPUs per node
#SBATCH --exclusive                   # Exclusive access
#SBATCH --output=logs/%x-%j.out

# 1. Get Master Node Info (For PyTorch DDP)
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

export MASTER_ADDR=$head_node_ip
export MASTER_PORT=29500
export WORLD_SIZE=$(( SLURM_NNODES * SLURM_GPUS_ON_NODE ))

echo "Master: $MASTER_ADDR:$MASTER_PORT"

# 2. Generate Runtime Script
cat <<EOF > run_task.sh
#!/bin/bash
source /home/user/anaconda3/bin/activate llm

torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=8 \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train_llm.py \
    --deepspeed ds_config.json
EOF
chmod +x run_task.sh

# 3. Launch in Parallel
srun --output=logs/job_%j_node_%N.log label ./run_task.sh

4. Key Parameters

Parameter	Flag	Description	Example
`--job-name`	`-J`	Job identifier	`gpt_train`
`--partition`	`-p`	Queue partition	`gpu_h800`
`--nodes`	`-N`	Node count	`4`
`--ntasks-per-node`		Tasks (MPI Ranks) per node	`1`
`--gres`		Generic resources	`gpu:8`
`--exclusive`		Exclusive node access	-

5. Cheat Sheet

5.1 Job Management

Submit: sbatch <script>
Cancel: scancel <job_id>

Interactive (Debug):

bash

srun -p debug -N 1 --gres=gpu:1 --pty /bin/bash

5.2 Status

Queue: squeue -u $USER
Nodes: sinfo
Job Details: scontrol show job <job_id>

6. Troubleshooting

Q: `Invalid partition name`

A: Use sinfo to verify partition names.

Q: `Requested node configuration is not available`

A: You requested more CPUs/GPUs/RAM than a single node physically possesses.

Q: Hanging at initialization

A: Check MASTER_ADDR resolution, firewall rules on MASTER_PORT, and NCCL health.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

Slurm User Guide: Job Submission & Management

1. Introduction

2. Basic Job Template (Single Node)

2.1 Script (`submit_single.sh`)

2.2 Submission

3. Distributed Training Template (Multi-Node)

3.1 Script (`submit_dist.sh`)

4. Key Parameters

5. Cheat Sheet

5.1 Job Management

5.2 Status

6. Troubleshooting

Q: `Invalid partition name`

Q: `Requested node configuration is not available`

Q: Hanging at initialization

Slurm User Guide: Job Submission & Management ​

1. Introduction ​

2. Basic Job Template (Single Node) ​

2.1 Script (submit_single.sh) ​

2.2 Submission ​

3. Distributed Training Template (Multi-Node) ​

3.1 Script (submit_dist.sh) ​

4. Key Parameters ​

5. Cheat Sheet ​

5.1 Job Management ​

5.2 Status ​

6. Troubleshooting ​

Q: Invalid partition name ​

Q: Requested node configuration is not available ​

Q: Hanging at initialization ​

Slurm User Guide: Job Submission & Management

1. Introduction

2. Basic Job Template (Single Node)

2.1 Script (`submit_single.sh`)

2.2 Submission

3. Distributed Training Template (Multi-Node)

3.1 Script (`submit_dist.sh`)

4. Key Parameters

5. Cheat Sheet

5.1 Job Management

5.2 Status

6. Troubleshooting

Q: `Invalid partition name`

Q: `Requested node configuration is not available`

Q: Hanging at initialization