NVIDIA PyTorch Docker Deployment Guide
Abstract: This document details the process of deploying the official NVIDIA PyTorch container on GPU nodes. It covers startup parameters, SSH configuration, image distribution, and setting up a VSCode remote development environment.
1. Pulling the Image
1.1 NGC Catalog
NVIDIA GPU Cloud (NGC) provides highly optimized deep learning framework images pre-installed with CUDA, cuDNN, NCCL, etc.
1.2 Pull Command
Ensure the image version is compatible with your host driver.
# Pull version 24.07 (PyTorch 2.x, CUDA 12.x)
docker pull nvcr.io/nvidia/pytorch:24.07-py32. Running the Container
To fully utilize hardware resources (GPU, IB Network, Memory), specific flags are required.
docker run -it \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--privileged=true \
--network host \
--shm-size=20G \
--name pytorch_dev \
nvcr.io/nvidia/pytorch:24.07-py3Parameter Explanation
| Flag | Purpose | Necessity |
|---|---|---|
--gpus all | Pass-through GPUs | Critical. Enables CUDA access. |
--ipc=host | Share Host IPC | Critical. Required for DDP and DataLoaders. |
--ulimit memlock=-1 | Unlimited Memlock | Critical. Required for IB RDMA (Pinned Memory). |
--ulimit stack=... | Increase Stack | Prevents Stack Overflow in deep recursion. |
--network host | Host Networking | Simplifies SSH access; lowest latency. |
--shm-size=20G | Shared Memory | Default 64M is too small, causing DataLoader Bus error. |
--privileged=true | Privileged Mode | Allows access to driver files. |
3. Container Configuration
After entering the container, install SSH for remote access.
3.1 Install Tools
apt-get update
apt-get install -y iproute2 net-tools inetutils-ping openssh-server vim3.2 Configure SSH
Edit Config:
/etc/ssh/sshd_config.bash# Change Port (Avoid conflict with host 22 if using host network) Port 20240 # Allow Root Login PermitRootLogin yesSet Password:
bashpasswd root # Set a password (e.g., 123456)Start Service:
bashservice ssh restart
3.3 Verify Access
From the host or another machine:
ssh root@<HOST_IP> -p 202404. Mounting Volumes
Map host directories to the container to persist code and datasets.
docker run -it ... \
-v /home/user/project:/workspace/project \
-v /data/datasets:/workspace/data \
nvcr.io/nvidia/pytorch:24.07-py35. Image Persistence
Save your environment after installing custom libraries (pip install ...).
5.1 Commit
# Get Container ID
docker ps
# Commit to new image
docker commit <Container_ID> pytorch24:custom-v15.2 Export (Save)
For offline distribution.
docker save -o pytorch24-custom.tar pytorch24:custom-v15.3 Import (Load)
On the target machine:
docker load -i pytorch24-custom.tar6. VSCode Remote Debugging
The most efficient way to develop.
- Install Extension: Install Remote - SSH in VSCode.
- Configure Host: Edit your SSH config file.text
Host gpu-container HostName <HOST_IP> Port 20240 User root - Connect: Click
><(bottom left) ->Connect to Host->gpu-container. - Develop: You can now edit files and debug Python code running inside the container directly.
7. Summary
| Step | Command | Key Note |
|---|---|---|
| Pull | docker pull | Use official nvcr.io images. |
| Run | docker run | Always use --ipc=host --shm-size. |
| Connect | ssh -p <port> | Install openssh-server inside. |
| Debug | VSCode Remote | Best developer experience. |
| Distribute | commit / save | Snapshot environment for the cluster. |
