NVIDIA PyTorch Docker Deployment Guide

Abstract: This document details the process of deploying the official NVIDIA PyTorch container on GPU nodes. It covers startup parameters, SSH configuration, image distribution, and setting up a VSCode remote development environment.

1. Pulling the Image

1.1 NGC Catalog

NVIDIA GPU Cloud (NGC) provides highly optimized deep learning framework images pre-installed with CUDA, cuDNN, NCCL, etc.

URL: NVIDIA NGC Catalog - PyTorch

1.2 Pull Command

Ensure the image version is compatible with your host driver.

bash

# Pull version 24.07 (PyTorch 2.x, CUDA 12.x)
docker pull nvcr.io/nvidia/pytorch:24.07-py3

2. Running the Container

To fully utilize hardware resources (GPU, IB Network, Memory), specific flags are required.

bash

docker run -it \
  --gpus all \
  --ipc=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --privileged=true \
  --network host \
  --shm-size=20G \
  --name pytorch_dev \
  nvcr.io/nvidia/pytorch:24.07-py3

Parameter Explanation

Flag	Purpose	Necessity
`--gpus all`	Pass-through GPUs	Critical. Enables CUDA access.
`--ipc=host`	Share Host IPC	Critical. Required for DDP and DataLoaders.
`--ulimit memlock=-1`	Unlimited Memlock	Critical. Required for IB RDMA (Pinned Memory).
`--ulimit stack=...`	Increase Stack	Prevents Stack Overflow in deep recursion.
`--network host`	Host Networking	Simplifies SSH access; lowest latency.
`--shm-size=20G`	Shared Memory	Default 64M is too small, causing DataLoader `Bus error`.
`--privileged=true`	Privileged Mode	Allows access to driver files.

3. Container Configuration

After entering the container, install SSH for remote access.

3.1 Install Tools

bash

apt-get update
apt-get install -y iproute2 net-tools inetutils-ping openssh-server vim

3.2 Configure SSH

Edit Config: /etc/ssh/sshd_config.

bash

# Change Port (Avoid conflict with host 22 if using host network)
Port 20240
# Allow Root Login
PermitRootLogin yes

Set Password:

bash

passwd root
# Set a password (e.g., 123456)

Start Service:
bash
```
service ssh restart
```

3.3 Verify Access

From the host or another machine:

bash

ssh root@<HOST_IP> -p 20240

4. Mounting Volumes

Map host directories to the container to persist code and datasets.

bash

docker run -it ... \
  -v /home/user/project:/workspace/project \
  -v /data/datasets:/workspace/data \
  nvcr.io/nvidia/pytorch:24.07-py3

5. Image Persistence

Save your environment after installing custom libraries (pip install ...).

5.1 Commit

bash

# Get Container ID
docker ps

# Commit to new image
docker commit <Container_ID> pytorch24:custom-v1

5.2 Export (Save)

For offline distribution.

bash

docker save -o pytorch24-custom.tar pytorch24:custom-v1

5.3 Import (Load)

On the target machine:

bash

docker load -i pytorch24-custom.tar

6. VSCode Remote Debugging

The most efficient way to develop.

Install Extension: Install Remote - SSH in VSCode.

Configure Host: Edit your SSH config file.

text

Host gpu-container
    HostName <HOST_IP>
    Port 20240
    User root

Connect: Click >< (bottom left) -> Connect to Host -> gpu-container.
Develop: You can now edit files and debug Python code running inside the container directly.

7. Summary

Step	Command	Key Note
Pull	`docker pull`	Use official `nvcr.io` images.
Run	`docker run`	Always use `--ipc=host --shm-size`.
Connect	`ssh -p <port>`	Install `openssh-server` inside.
Debug	VSCode Remote	Best developer experience.
Distribute	`commit` / `save`	Snapshot environment for the cluster.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

NVIDIA PyTorch Docker Deployment Guide

1. Pulling the Image

1.1 NGC Catalog

1.2 Pull Command

2. Running the Container

Parameter Explanation

3. Container Configuration

3.1 Install Tools

3.2 Configure SSH

3.3 Verify Access

4. Mounting Volumes

5. Image Persistence

5.1 Commit

5.2 Export (Save)

5.3 Import (Load)

6. VSCode Remote Debugging

7. Summary

NVIDIA PyTorch Docker Deployment Guide ​

1. Pulling the Image ​

1.1 NGC Catalog ​

1.2 Pull Command ​

2. Running the Container ​

Parameter Explanation ​

3. Container Configuration ​

3.1 Install Tools ​

3.2 Configure SSH ​

3.3 Verify Access ​

4. Mounting Volumes ​

5. Image Persistence ​

5.1 Commit ​

5.2 Export (Save) ​

5.3 Import (Load) ​

6. VSCode Remote Debugging ​

7. Summary ​

NVIDIA PyTorch Docker Deployment Guide

1. Pulling the Image

1.1 NGC Catalog

1.2 Pull Command

2. Running the Container

Parameter Explanation

3. Container Configuration

3.1 Install Tools

3.2 Configure SSH

3.3 Verify Access

4. Mounting Volumes

5. Image Persistence

5.1 Commit

5.2 Export (Save)

5.3 Import (Load)

6. VSCode Remote Debugging

7. Summary