Skip to content

Slurm Practice: Distributed Training Submission Template Based on Docker

This article introduces how to submit a distributed training task containing a Docker container environment via the Slurm Scheduling System. This solution combines Slurm's resource scheduling capabilities with Docker's environment isolation, making it ideal for deep learning training on large-scale clusters (e.g., Megatron-LM).

1. Solution Overview

In HPC clusters, users are typically required to run jobs with non-root privileges while needing specific CUDA/Python environments. The core logic of this solution is as follows:

  1. Resource Allocation: Request compute nodes and GPU resources via sbatch.
  2. Container Launch: Use srun to launch a Docker container on each allocated node.
  3. Script Generation: Dynamically generate the training script (run_training.sh) inside the Shell script and pass it to the container via volume mounting.
  4. Distributed Communication: Utilize the PyTorch Elastic (c10d) backend, combined with Slurm environment variables, to automatically discover the Master node and establish communication.

2. Complete Job Script (Sbatch Template)

Below is a ready-to-use megatron_docker.slurm script template. This script demonstrates a Megatron-DeepSeek-R1-Distill-Qwen-1.5B model training task.

bash
#!/bin/bash
#SBATCH --job-name=megatron              # Job name
#SBATCH --output=output_%j.log           # Standard output log (%j is Job ID)
#SBATCH --error=error_%j.log             # Error log
#SBATCH --partition=TEST1                # Partition name
#SBATCH --account=test1267               # Account name (Billing/Access)
#SBATCH --exclude=g[01-86,88-89,91]      # (Optional) Exclude specific nodes
#SBATCH --gres=gpu:8                     # Request 8 GPUs per node
#SBATCH --ntasks=2                       # Total tasks (Equals total nodes here)
#SBATCH --nodes=2                        # Request 2 nodes
#SBATCH --ntasks-per-node=1              # 1 task per node (1 Docker container)
#SBATCH --cpus-per-task=64               # 64 CPU cores per task
#SBATCH --mem=1000G                      # 1000G memory per node

# ==========================================
# 1. Environment & Path Configuration
# ==========================================
# Docker Image & Path Mapping
docker_image="megatron_pytorch:latest"
docker_workspace="/workspace/files"      # Path inside container
host_workspace="/home/test/test06/qzk/"  # Path on host (to be mounted)

# Training Data & Model Paths (Relative to container path)
load_path="${docker_workspace}/PLM/Megatron-DeepSeek-R1-Distill-Qwen-1.5B/"
save_path="data/Distill-Qwen-1.5B-OpenR1-Math-94k/checkpoints/"
tensorboard_path="data/Distill-Qwen-1.5B-OpenR1-Math-94k/tensorboard/"

# ==========================================
# 2. Dynamic Training Script Generation (run_training.sh)
# ==========================================
# Note: This script will run INSIDE the Docker container
cat <<RUN_EOF > run_training.sh
#!/bin/bash
set -ex

# 2.1 Install specific dependencies (if needed)
# pip install modelbest_sdk-0.2.5.7-py3-none-any.whl

# 2.2 Get Distributed Environment Parameters
# Automatically detect GPUs on current node
GPUS_PER_NODE=\$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader | wc -l)

# Read World Info from Slurm Environment Variables
WORLD_SIZE=\${SLURM_NTASKS}
RANK=\${SLURM_PROCID}
# Get Master Node IP (First hostname in the list)
MASTER_ADDR=\$(scontrol show hostname \$SLURM_NODELIST | head -n 1)
MASTER_PORT=6420

echo "Node Info: Rank \${RANK}/\${WORLD_SIZE} | Master: \${MASTER_ADDR}:\${MASTER_PORT} | GPUs: \${GPUS_PER_NODE}"

# 2.3 Model & Data Configuration
TOKENIZER_MODEL=${docker_workspace}/PLM/DeepSeek-R1-Distill-Qwen-1.5B
# Data Path Format: "Weight Path"
DATA_PATH="1.0 ${docker_workspace}/Datasets/OpenR1-Math-prefix/OpenR1-Math-220k_deepseek_qwen-prefix"

# 2.4 Build PyTorch Distributed Args (torchrun)
DISTRIBUTED_ARGS=(
    --nproc_per_node \$GPUS_PER_NODE
    --nnodes \$WORLD_SIZE
    --node_rank \$RANK
    --master_addr \$MASTER_ADDR
    --master_port \$MASTER_PORT
    --rdzv_backend c10d
    --rdzv_endpoint \$MASTER_ADDR:\$MASTER_PORT
)

# 2.5 Megatron Model Args (Example)
GPT_MODEL_ARGS=(
    --vocab-size 151936
    --make-vocab-size-divisible-by 1
    --num-layers 28
    --hidden-size 1536
    --num-attention-heads 16
    --seq-length 4096
    --max-position-embeddings 4096
    --untie-embeddings-and-output-weights
)

# 2.6 Training Hyperparameters
TRAINING_ARGS=(
    --micro-batch-size 4
    --global-batch-size 512
    --lr 1e-5
    --lr-decay-style cosine
    --train-iters 500000
    --weight-decay 0.1
    --clip-grad 1.0
    --log-interval 10
    --save-interval 1000
    --eval-interval 1000
    --eval-iters 10
)

# 2.7 Output Configuration
OUTPUT_ARGS=(
    --save ${docker_workspace}/${save_path}
    --load ${load_path}
    --tensorboard-dir ${docker_workspace}/${tensorboard_path}
)

# 2.8 Launch Command
torchrun \
    "\${DISTRIBUTED_ARGS[@]}" \
    pretrain_gpt.py \
    "\${GPT_MODEL_ARGS[@]}" \
    "\${TRAINING_ARGS[@]}" \
    "\${OUTPUT_ARGS[@]}" \
    --tokenizer-model \${TOKENIZER_MODEL} \
    --data-path \${DATA_PATH}
RUN_EOF

# Make script executable
chmod +x run_training.sh

# ==========================================
# 3. Submit Task to Docker
# ==========================================
echo "Starting Docker container on all nodes..."

# srun will execute the following command on EVERY allocated node
srun bash -c "
docker run --rm \\
  --gpus all \\
  --network host \\
  --ipc=host \\
  --ulimit memlock=-1 \\
  --ulimit stack=67108864 \\
  --privileged=true \\
  -v ${host_workspace}:${docker_workspace} \\
  -w ${docker_workspace} \\
  ${docker_image} \\
  bash run_training.sh
"

3. Core Mechanism Analysis

3.1 Path Mapping Strategy

To maintain data consistency between the container and host, volume mounting is used. It is recommended to use unified variables:

  • host_workspace: Storage path on the physical machine (e.g., Lustre/NFS mount point).
  • docker_workspace: Standard working directory inside the container.
  • Tip: Set both load_path and save_path as subdirectories of docker_workspace so that a single mount grants access to all resources.

3.2 Distributed Environment Injection

PyTorch Elastic (torchrun) needs to know the cluster topology. We dynamically retrieve this information in the Slurm script and pass it to the container:

ParameterSourceDescription
WORLD_SIZE$SLURM_NTASKSTotal number of processes (usually equal to node count)
RANK$SLURM_PROCIDGlobal index of the current node (0, 1, ...)
MASTER_ADDRscontrol show hostname ...Use the first node's hostname as Master
GPUS_PER_NODEnvidia-smiDetect dynamically to avoid hardcoding

3.3 Docker Key Parameters

To achieve bare-metal performance, Docker launch parameters are critical:

  • --gpus all: Pass through all GPU devices.
  • --network host: Use the host network stack to avoid NAT overhead (Critical for IB/RoCE).
  • --ipc=host: Share host shared memory to prevent DataLoader deadlocks.
  • --ulimit memlock=-1: Remove memory lock limits to support RDMA Pin Memory.

4. Best Practices & Notes

Debugging Advice

For the first submission, set #SBATCH --time=00:10:00 and use srun --pty ... for an interactive test to ensure the Docker image can be pulled and paths are mounted correctly.

  1. Image Caching: In large clusters without internet access, ensure docker_image is distributed via docker pull or docker load to all compute nodes beforehand, or use a local Harbor registry.
  2. Permissions: If the cluster prohibits standard docker commands for users, contact admins to configure User Namespace or switch to Singularity/Apptainer (Standard for HPC).
  3. Log Management: The script uses set -ex, so all executed commands are printed to output_%j.log. If training hangs, check error_%j.log for NCCL errors.
  4. Dead Links: If using local .whl packages (e.g., modelbest_sdk), ensure the file exists in host_workspace; otherwise, pip install inside the container will fail.

AI-HPC Organization