Slurm Practice: Distributed Training Submission Template Based on Docker

This article introduces how to submit a distributed training task containing a Docker container environment via the Slurm Scheduling System. This solution combines Slurm's resource scheduling capabilities with Docker's environment isolation, making it ideal for deep learning training on large-scale clusters (e.g., Megatron-LM).

1. Solution Overview

In HPC clusters, users are typically required to run jobs with non-root privileges while needing specific CUDA/Python environments. The core logic of this solution is as follows:

Resource Allocation: Request compute nodes and GPU resources via sbatch.
Container Launch: Use srun to launch a Docker container on each allocated node.
Script Generation: Dynamically generate the training script (run_training.sh) inside the Shell script and pass it to the container via volume mounting.
Distributed Communication: Utilize the PyTorch Elastic (c10d) backend, combined with Slurm environment variables, to automatically discover the Master node and establish communication.

2. Complete Job Script (Sbatch Template)

Below is a ready-to-use megatron_docker.slurm script template. This script demonstrates a Megatron-DeepSeek-R1-Distill-Qwen-1.5B model training task.

bash

#!/bin/bash
#SBATCH --job-name=megatron              # Job name
#SBATCH --output=output_%j.log           # Standard output log (%j is Job ID)
#SBATCH --error=error_%j.log             # Error log
#SBATCH --partition=TEST1                # Partition name
#SBATCH --account=test1267               # Account name (Billing/Access)
#SBATCH --exclude=g[01-86,88-89,91]      # (Optional) Exclude specific nodes
#SBATCH --gres=gpu:8                     # Request 8 GPUs per node
#SBATCH --ntasks=2                       # Total tasks (Equals total nodes here)
#SBATCH --nodes=2                        # Request 2 nodes
#SBATCH --ntasks-per-node=1              # 1 task per node (1 Docker container)
#SBATCH --cpus-per-task=64               # 64 CPU cores per task
#SBATCH --mem=1000G                      # 1000G memory per node

# ==========================================
# 1. Environment & Path Configuration
# ==========================================
# Docker Image & Path Mapping
docker_image="megatron_pytorch:latest"
docker_workspace="/workspace/files"      # Path inside container
host_workspace="/home/test/test06/qzk/"  # Path on host (to be mounted)

# Training Data & Model Paths (Relative to container path)
load_path="${docker_workspace}/PLM/Megatron-DeepSeek-R1-Distill-Qwen-1.5B/"
save_path="data/Distill-Qwen-1.5B-OpenR1-Math-94k/checkpoints/"
tensorboard_path="data/Distill-Qwen-1.5B-OpenR1-Math-94k/tensorboard/"

# ==========================================
# 2. Dynamic Training Script Generation (run_training.sh)
# ==========================================
# Note: This script will run INSIDE the Docker container
cat <<RUN_EOF > run_training.sh
#!/bin/bash
set -ex

# 2.1 Install specific dependencies (if needed)
# pip install modelbest_sdk-0.2.5.7-py3-none-any.whl

# 2.2 Get Distributed Environment Parameters
# Automatically detect GPUs on current node
GPUS_PER_NODE=\$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader | wc -l)

# Read World Info from Slurm Environment Variables
WORLD_SIZE=\${SLURM_NTASKS}
RANK=\${SLURM_PROCID}
# Get Master Node IP (First hostname in the list)
MASTER_ADDR=\$(scontrol show hostname \$SLURM_NODELIST | head -n 1)
MASTER_PORT=6420

echo "Node Info: Rank \${RANK}/\${WORLD_SIZE} | Master: \${MASTER_ADDR}:\${MASTER_PORT} | GPUs: \${GPUS_PER_NODE}"

# 2.3 Model & Data Configuration
TOKENIZER_MODEL=${docker_workspace}/PLM/DeepSeek-R1-Distill-Qwen-1.5B
# Data Path Format: "Weight Path"
DATA_PATH="1.0 ${docker_workspace}/Datasets/OpenR1-Math-prefix/OpenR1-Math-220k_deepseek_qwen-prefix"

# 2.4 Build PyTorch Distributed Args (torchrun)
DISTRIBUTED_ARGS=(
    --nproc_per_node \$GPUS_PER_NODE
    --nnodes \$WORLD_SIZE
    --node_rank \$RANK
    --master_addr \$MASTER_ADDR
    --master_port \$MASTER_PORT
    --rdzv_backend c10d
    --rdzv_endpoint \$MASTER_ADDR:\$MASTER_PORT
)

# 2.5 Megatron Model Args (Example)
GPT_MODEL_ARGS=(
    --vocab-size 151936
    --make-vocab-size-divisible-by 1
    --num-layers 28
    --hidden-size 1536
    --num-attention-heads 16
    --seq-length 4096
    --max-position-embeddings 4096
    --untie-embeddings-and-output-weights
)

# 2.6 Training Hyperparameters
TRAINING_ARGS=(
    --micro-batch-size 4
    --global-batch-size 512
    --lr 1e-5
    --lr-decay-style cosine
    --train-iters 500000
    --weight-decay 0.1
    --clip-grad 1.0
    --log-interval 10
    --save-interval 1000
    --eval-interval 1000
    --eval-iters 10
)

# 2.7 Output Configuration
OUTPUT_ARGS=(
    --save ${docker_workspace}/${save_path}
    --load ${load_path}
    --tensorboard-dir ${docker_workspace}/${tensorboard_path}
)

# 2.8 Launch Command
torchrun \
    "\${DISTRIBUTED_ARGS[@]}" \
    pretrain_gpt.py \
    "\${GPT_MODEL_ARGS[@]}" \
    "\${TRAINING_ARGS[@]}" \
    "\${OUTPUT_ARGS[@]}" \
    --tokenizer-model \${TOKENIZER_MODEL} \
    --data-path \${DATA_PATH}
RUN_EOF

# Make script executable
chmod +x run_training.sh

# ==========================================
# 3. Submit Task to Docker
# ==========================================
echo "Starting Docker container on all nodes..."

# srun will execute the following command on EVERY allocated node
srun bash -c "
docker run --rm \\
  --gpus all \\
  --network host \\
  --ipc=host \\
  --ulimit memlock=-1 \\
  --ulimit stack=67108864 \\
  --privileged=true \\
  -v ${host_workspace}:${docker_workspace} \\
  -w ${docker_workspace} \\
  ${docker_image} \\
  bash run_training.sh
"

3. Core Mechanism Analysis

3.1 Path Mapping Strategy

To maintain data consistency between the container and host, volume mounting is used. It is recommended to use unified variables:

host_workspace: Storage path on the physical machine (e.g., Lustre/NFS mount point).
docker_workspace: Standard working directory inside the container.
Tip: Set both load_path and save_path as subdirectories of docker_workspace so that a single mount grants access to all resources.

3.2 Distributed Environment Injection

PyTorch Elastic (torchrun) needs to know the cluster topology. We dynamically retrieve this information in the Slurm script and pass it to the container:

Parameter	Source	Description
`WORLD_SIZE`	`$SLURM_NTASKS`	Total number of processes (usually equal to node count)
`RANK`	`$SLURM_PROCID`	Global index of the current node (0, 1, ...)
`MASTER_ADDR`	`scontrol show hostname ...`	Use the first node's hostname as Master
`GPUS_PER_NODE`	`nvidia-smi`	Detect dynamically to avoid hardcoding

3.3 Docker Key Parameters

To achieve bare-metal performance, Docker launch parameters are critical:

--gpus all: Pass through all GPU devices.
--network host: Use the host network stack to avoid NAT overhead (Critical for IB/RoCE).
--ipc=host: Share host shared memory to prevent DataLoader deadlocks.
--ulimit memlock=-1: Remove memory lock limits to support RDMA Pin Memory.

4. Best Practices & Notes

Debugging Advice

For the first submission, set #SBATCH --time=00:10:00 and use srun --pty ... for an interactive test to ensure the Docker image can be pulled and paths are mounted correctly.

Image Caching: In large clusters without internet access, ensure docker_image is distributed via docker pull or docker load to all compute nodes beforehand, or use a local Harbor registry.
Permissions: If the cluster prohibits standard docker commands for users, contact admins to configure User Namespace or switch to Singularity/Apptainer (Standard for HPC).
Log Management: The script uses set -ex, so all executed commands are printed to output_%j.log. If training hangs, check error_%j.log for NCCL errors.
Dead Links: If using local .whl packages (e.g., modelbest_sdk), ensure the file exists in host_workspace; otherwise, pip install inside the container will fail.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

Slurm Practice: Distributed Training Submission Template Based on Docker

1. Solution Overview

2. Complete Job Script (Sbatch Template)

3. Core Mechanism Analysis

3.1 Path Mapping Strategy

3.2 Distributed Environment Injection

3.3 Docker Key Parameters

4. Best Practices & Notes

Slurm Practice: Distributed Training Submission Template Based on Docker ​

1. Solution Overview ​

2. Complete Job Script (Sbatch Template) ​

3. Core Mechanism Analysis ​

3.1 Path Mapping Strategy ​

3.2 Distributed Environment Injection ​

3.3 Docker Key Parameters ​

4. Best Practices & Notes ​

Slurm Practice: Distributed Training Submission Template Based on Docker

1. Solution Overview

2. Complete Job Script (Sbatch Template)

3. Core Mechanism Analysis

3.1 Path Mapping Strategy

3.2 Distributed Environment Injection

3.3 Docker Key Parameters

4. Best Practices & Notes