Skip to content

NVIDIA PyTorch Docker Deployment Guide

Abstract: This document details the process of deploying the official NVIDIA PyTorch container on GPU nodes. It covers startup parameters, SSH configuration, image distribution, and setting up a VSCode remote development environment.

1. Pulling the Image

1.1 NGC Catalog

NVIDIA GPU Cloud (NGC) provides highly optimized deep learning framework images pre-installed with CUDA, cuDNN, NCCL, etc.

1.2 Pull Command

Ensure the image version is compatible with your host driver.

bash
# Pull version 24.07 (PyTorch 2.x, CUDA 12.x)
docker pull nvcr.io/nvidia/pytorch:24.07-py3

2. Running the Container

To fully utilize hardware resources (GPU, IB Network, Memory), specific flags are required.

bash
docker run -it \
  --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --privileged=true \
  --network host \
  --shm-size=20G \
  --name pytorch_dev \
  nvcr.io/nvidia/pytorch:24.07-py3

Parameter Explanation

FlagPurposeNecessity
--gpus allPass-through GPUsCritical. Enables CUDA access.
--ipc=hostShare Host IPCCritical. Required for DDP and DataLoaders.
--ulimit memlock=-1Unlimited MemlockCritical. Required for IB RDMA (Pinned Memory).
--ulimit stack=...Increase StackPrevents Stack Overflow in deep recursion.
--network hostHost NetworkingSimplifies SSH access; lowest latency.
--shm-size=20GShared MemoryDefault 64M is too small, causing DataLoader Bus error.
--privileged=truePrivileged ModeAllows access to driver files.

3. Container Configuration

After entering the container, install SSH for remote access.

3.1 Install Tools

bash
apt-get update
apt-get install -y iproute2 net-tools inetutils-ping openssh-server vim

3.2 Configure SSH

  1. Edit Config: /etc/ssh/sshd_config.

    bash
    # Change Port (Avoid conflict with host 22 if using host network)
    Port 20240
    # Allow Root Login
    PermitRootLogin yes
  2. Set Password:

    bash
    passwd root
    # Set a password (e.g., 123456)
  3. Start Service:

    bash
    service ssh restart

3.3 Verify Access

From the host or another machine:

bash
ssh root@<HOST_IP> -p 20240

4. Mounting Volumes

Map host directories to the container to persist code and datasets.

bash
docker run -it ... \
  -v /home/user/project:/workspace/project \
  -v /data/datasets:/workspace/data \
  nvcr.io/nvidia/pytorch:24.07-py3

5. Image Persistence

Save your environment after installing custom libraries (pip install ...).

5.1 Commit

bash
# Get Container ID
docker ps

# Commit to new image
docker commit <Container_ID> pytorch24:custom-v1

5.2 Export (Save)

For offline distribution.

bash
docker save -o pytorch24-custom.tar pytorch24:custom-v1

5.3 Import (Load)

On the target machine:

bash
docker load -i pytorch24-custom.tar

6. VSCode Remote Debugging

The most efficient way to develop.

  1. Install Extension: Install Remote - SSH in VSCode.
  2. Configure Host: Edit your SSH config file.
    text
    Host gpu-container
        HostName <HOST_IP>
        Port 20240
        User root
  3. Connect: Click >< (bottom left) -> Connect to Host -> gpu-container.
  4. Develop: You can now edit files and debug Python code running inside the container directly.

7. Summary

StepCommandKey Note
Pulldocker pullUse official nvcr.io images.
Rundocker runAlways use --ipc=host --shm-size.
Connectssh -p <port>Install openssh-server inside.
DebugVSCode RemoteBest developer experience.
Distributecommit / saveSnapshot environment for the cluster.

AI-HPC Organization