GPU Driver & CUDA Installation Guide

Abstract: This guide provides the Standard Operating Procedure (SOP) for deploying NVIDIA GPU environments on Linux servers. It covers hardware checks, dependency installation, disabling Nouveau, driver installation, and CUDA configuration, including specific steps for SXM Fabric Manager.

1. Preparation

1.1 Hardware & System Check

Ensure the GPU is physically connected and detected.

bash

# Check PCI devices
lspci | grep -i nvidia

If no output, check physical slots or power connectors.

1.2 Download Software

Download the latest versions from NVIDIA:

Driver: NVIDIA Driver Downloads (Linux 64-bit)
CUDA: CUDA Toolkit Archive (Runfile (local) recommended)

1.3 Cleanup

Uninstall old versions if they exist.

bash

# Uninstall Driver
/usr/bin/nvidia-uninstall -s

# Uninstall CUDA
/usr/local/cuda-X.Y/bin/cuda-uninstaller

1.4 Install Dependencies

Install headers and toolchains required for compilation.

CentOS 7:

bash

yum install -y gcc gcc-c++ tar make bzip2 pkgconfig \
    libglvnd-devel elfutils-libelf-devel \
    kernel-devel-$(uname -r) kernel-headers-$(uname -r)

Ubuntu:

bash

apt-get update
apt-get install -y gcc g++ tar make pkg-config build-essential \
    libglvnd-dev linux-headers-$(uname -r)

1.5 Disable Nouveau

Nouveau is the open-source driver and must be disabled to install the official proprietary driver.

Check Status:
bash
```
lsmod | grep nouveau
```
Blacklist: CentOS/RHEL: /usr/lib/modprobe.d/blacklist-nouveau.confUbuntu: /etc/modprobe.d/blacklist.conf
bash
```
blacklist nouveau
options nouveau modeset=0
```

Rebuild Initramfs & Reboot:

bash

# CentOS
dracut -f
# Ubuntu
update-initramfs -u

reboot

2. GPU Driver Installation

Runlevel

Installation must be performed in Text Mode (Runlevel 3) with the X-Server stopped.

2.1 Install Driver

Assuming the package is named NVIDIA-Linux-x86_64-xxx.run.

bash

chmod +x NVIDIA-Linux-x86_64-*.run

# -a: Accept license, -s: Silent, -Z: Disable Nouveau, --no-opengl-files: Avoid UI conflicts
./NVIDIA-Linux-x86_64-*.run -a -s -Z --no-opengl-files

2.2 Persistence Mode

Enabling persistence mode prevents the driver from unloading when idle, reducing latency.

bash

# Method A: System Service (Recommended)
systemctl enable nvidia-persistenced
systemctl start nvidia-persistenced

# Method B: Command Line (Temporary)
nvidia-smi -pm 1

2.3 Fabric Manager (SXM Required)

For A100/H100/H800 SXM (NVLink) systems, Fabric Manager is mandatory. Without it, NVLink will not function.

bash

# Install (RPM example, version must match driver exactly)
rpm -ivh nvidia-fabricmanager-*.rpm

# Start Service
systemctl enable --now nvidia-fabricmanager

3. CUDA Toolkit Installation

3.1 Install CUDA

bash

chmod +x cuda_*.run

# --no-opengl-libs: Skip OpenGL libs
# Note: Uncheck "Driver" in the interactive menu (since it's installed in Step 2)
./cuda_*.run --no-opengl-libs

3.2 Environment Variables

Edit /etc/profile or ~/.bashrc:

bash

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Apply and verify:

bash

source /etc/profile
nvcc -V

4. Verification

4.1 Commands

Component	Command	Expected Result
Driver Status	`nvidia-smi`	Shows GPU list, VRAM, Power
CUDA Version	`nvcc -V`	Shows version number
Fabric Manager	`systemctl status nvidia-fabricmanager`	Active (Running)
Persistence	`systemctl status nvidia-persistenced`	Active (Running)
Nouveau	`lsmod	grep nouveau`

4.2 Troubleshooting

rmmod: ERROR: Module nouveau is in use: Nouveau is still running. Ensure you switched to Init 3 and rebooted.
Install Failed: Check /var/log/nvidia-installer.log. Often caused by a mismatch between kernel-devel version and uname -r.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

GPU Driver & CUDA Installation Guide

1. Preparation

1.1 Hardware & System Check

1.2 Download Software

1.3 Cleanup

1.4 Install Dependencies

1.5 Disable Nouveau

2. GPU Driver Installation

2.1 Install Driver

2.2 Persistence Mode

2.3 Fabric Manager (SXM Required)

3. CUDA Toolkit Installation

3.1 Install CUDA

3.2 Environment Variables

4. Verification

4.1 Commands

4.2 Troubleshooting

GPU Driver & CUDA Installation Guide ​

1. Preparation ​

1.1 Hardware & System Check ​

1.2 Download Software ​

1.3 Cleanup ​

1.4 Install Dependencies ​

1.5 Disable Nouveau ​

2. GPU Driver Installation ​

2.1 Install Driver ​

2.2 Persistence Mode ​

2.3 Fabric Manager (SXM Required) ​

3. CUDA Toolkit Installation ​

3.1 Install CUDA ​

3.2 Environment Variables ​

4. Verification ​

4.1 Commands ​

4.2 Troubleshooting ​

GPU Driver & CUDA Installation Guide

1. Preparation

1.1 Hardware & System Check

1.2 Download Software

1.3 Cleanup

1.4 Install Dependencies

1.5 Disable Nouveau

2. GPU Driver Installation

2.1 Install Driver

2.2 Persistence Mode

2.3 Fabric Manager (SXM Required)

3. CUDA Toolkit Installation

3.1 Install CUDA

3.2 Environment Variables

4. Verification

4.1 Commands

4.2 Troubleshooting