Slurm Practice: Distributed Training Submission Template Based on Docker
This article introduces how to submit a distributed training task containing a Docker container environment via the Slurm Scheduling System. This solution combines Slurm's resource scheduling capabilities with Docker's environment isolation, making it ideal for deep learning training on large-scale clusters (e.g., Megatron-LM).
1. Solution Overview
In HPC clusters, users are typically required to run jobs with non-root privileges while needing specific CUDA/Python environments. The core logic of this solution is as follows:
- Resource Allocation: Request compute nodes and GPU resources via
sbatch. - Container Launch: Use
srunto launch a Docker container on each allocated node. - Script Generation: Dynamically generate the training script (
run_training.sh) inside the Shell script and pass it to the container via volume mounting. - Distributed Communication: Utilize the PyTorch Elastic (c10d) backend, combined with Slurm environment variables, to automatically discover the Master node and establish communication.
2. Complete Job Script (Sbatch Template)
Below is a ready-to-use megatron_docker.slurm script template. This script demonstrates a Megatron-DeepSeek-R1-Distill-Qwen-1.5B model training task.
#!/bin/bash
#SBATCH --job-name=megatron # Job name
#SBATCH --output=output_%j.log # Standard output log (%j is Job ID)
#SBATCH --error=error_%j.log # Error log
#SBATCH --partition=TEST1 # Partition name
#SBATCH --account=test1267 # Account name (Billing/Access)
#SBATCH --exclude=g[01-86,88-89,91] # (Optional) Exclude specific nodes
#SBATCH --gres=gpu:8 # Request 8 GPUs per node
#SBATCH --ntasks=2 # Total tasks (Equals total nodes here)
#SBATCH --nodes=2 # Request 2 nodes
#SBATCH --ntasks-per-node=1 # 1 task per node (1 Docker container)
#SBATCH --cpus-per-task=64 # 64 CPU cores per task
#SBATCH --mem=1000G # 1000G memory per node
# ==========================================
# 1. Environment & Path Configuration
# ==========================================
# Docker Image & Path Mapping
docker_image="megatron_pytorch:latest"
docker_workspace="/workspace/files" # Path inside container
host_workspace="/home/test/test06/qzk/" # Path on host (to be mounted)
# Training Data & Model Paths (Relative to container path)
load_path="${docker_workspace}/PLM/Megatron-DeepSeek-R1-Distill-Qwen-1.5B/"
save_path="data/Distill-Qwen-1.5B-OpenR1-Math-94k/checkpoints/"
tensorboard_path="data/Distill-Qwen-1.5B-OpenR1-Math-94k/tensorboard/"
# ==========================================
# 2. Dynamic Training Script Generation (run_training.sh)
# ==========================================
# Note: This script will run INSIDE the Docker container
cat <<RUN_EOF > run_training.sh
#!/bin/bash
set -ex
# 2.1 Install specific dependencies (if needed)
# pip install modelbest_sdk-0.2.5.7-py3-none-any.whl
# 2.2 Get Distributed Environment Parameters
# Automatically detect GPUs on current node
GPUS_PER_NODE=\$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader | wc -l)
# Read World Info from Slurm Environment Variables
WORLD_SIZE=\${SLURM_NTASKS}
RANK=\${SLURM_PROCID}
# Get Master Node IP (First hostname in the list)
MASTER_ADDR=\$(scontrol show hostname \$SLURM_NODELIST | head -n 1)
MASTER_PORT=6420
echo "Node Info: Rank \${RANK}/\${WORLD_SIZE} | Master: \${MASTER_ADDR}:\${MASTER_PORT} | GPUs: \${GPUS_PER_NODE}"
# 2.3 Model & Data Configuration
TOKENIZER_MODEL=${docker_workspace}/PLM/DeepSeek-R1-Distill-Qwen-1.5B
# Data Path Format: "Weight Path"
DATA_PATH="1.0 ${docker_workspace}/Datasets/OpenR1-Math-prefix/OpenR1-Math-220k_deepseek_qwen-prefix"
# 2.4 Build PyTorch Distributed Args (torchrun)
DISTRIBUTED_ARGS=(
--nproc_per_node \$GPUS_PER_NODE
--nnodes \$WORLD_SIZE
--node_rank \$RANK
--master_addr \$MASTER_ADDR
--master_port \$MASTER_PORT
--rdzv_backend c10d
--rdzv_endpoint \$MASTER_ADDR:\$MASTER_PORT
)
# 2.5 Megatron Model Args (Example)
GPT_MODEL_ARGS=(
--vocab-size 151936
--make-vocab-size-divisible-by 1
--num-layers 28
--hidden-size 1536
--num-attention-heads 16
--seq-length 4096
--max-position-embeddings 4096
--untie-embeddings-and-output-weights
)
# 2.6 Training Hyperparameters
TRAINING_ARGS=(
--micro-batch-size 4
--global-batch-size 512
--lr 1e-5
--lr-decay-style cosine
--train-iters 500000
--weight-decay 0.1
--clip-grad 1.0
--log-interval 10
--save-interval 1000
--eval-interval 1000
--eval-iters 10
)
# 2.7 Output Configuration
OUTPUT_ARGS=(
--save ${docker_workspace}/${save_path}
--load ${load_path}
--tensorboard-dir ${docker_workspace}/${tensorboard_path}
)
# 2.8 Launch Command
torchrun \
"\${DISTRIBUTED_ARGS[@]}" \
pretrain_gpt.py \
"\${GPT_MODEL_ARGS[@]}" \
"\${TRAINING_ARGS[@]}" \
"\${OUTPUT_ARGS[@]}" \
--tokenizer-model \${TOKENIZER_MODEL} \
--data-path \${DATA_PATH}
RUN_EOF
# Make script executable
chmod +x run_training.sh
# ==========================================
# 3. Submit Task to Docker
# ==========================================
echo "Starting Docker container on all nodes..."
# srun will execute the following command on EVERY allocated node
srun bash -c "
docker run --rm \\
--gpus all \\
--network host \\
--ipc=host \\
--ulimit memlock=-1 \\
--ulimit stack=67108864 \\
--privileged=true \\
-v ${host_workspace}:${docker_workspace} \\
-w ${docker_workspace} \\
${docker_image} \\
bash run_training.sh
"3. Core Mechanism Analysis
3.1 Path Mapping Strategy
To maintain data consistency between the container and host, volume mounting is used. It is recommended to use unified variables:
host_workspace: Storage path on the physical machine (e.g., Lustre/NFS mount point).docker_workspace: Standard working directory inside the container.- Tip: Set both
load_pathandsave_pathas subdirectories ofdocker_workspaceso that a single mount grants access to all resources.
3.2 Distributed Environment Injection
PyTorch Elastic (torchrun) needs to know the cluster topology. We dynamically retrieve this information in the Slurm script and pass it to the container:
| Parameter | Source | Description |
|---|---|---|
WORLD_SIZE | $SLURM_NTASKS | Total number of processes (usually equal to node count) |
RANK | $SLURM_PROCID | Global index of the current node (0, 1, ...) |
MASTER_ADDR | scontrol show hostname ... | Use the first node's hostname as Master |
GPUS_PER_NODE | nvidia-smi | Detect dynamically to avoid hardcoding |
3.3 Docker Key Parameters
To achieve bare-metal performance, Docker launch parameters are critical:
--gpus all: Pass through all GPU devices.--network host: Use the host network stack to avoid NAT overhead (Critical for IB/RoCE).--ipc=host: Share host shared memory to prevent DataLoader deadlocks.--ulimit memlock=-1: Remove memory lock limits to support RDMA Pin Memory.
4. Best Practices & Notes
Debugging Advice
For the first submission, set #SBATCH --time=00:10:00 and use srun --pty ... for an interactive test to ensure the Docker image can be pulled and paths are mounted correctly.
- Image Caching: In large clusters without internet access, ensure
docker_imageis distributed viadocker pullordocker loadto all compute nodes beforehand, or use a local Harbor registry. - Permissions: If the cluster prohibits standard
dockercommands for users, contact admins to configure User Namespace or switch to Singularity/Apptainer (Standard for HPC). - Log Management: The script uses
set -ex, so all executed commands are printed tooutput_%j.log. If training hangs, checkerror_%j.logfor NCCL errors. - Dead Links: If using local
.whlpackages (e.g.,modelbest_sdk), ensure the file exists inhost_workspace; otherwise,pip installinside the container will fail.
