Skip to content

GPU Driver & CUDA Installation Guide

Abstract: This guide provides the Standard Operating Procedure (SOP) for deploying NVIDIA GPU environments on Linux servers. It covers hardware checks, dependency installation, disabling Nouveau, driver installation, and CUDA configuration, including specific steps for SXM Fabric Manager.

1. Preparation

1.1 Hardware & System Check

Ensure the GPU is physically connected and detected.

bash
# Check PCI devices
lspci | grep -i nvidia

If no output, check physical slots or power connectors.

1.2 Download Software

Download the latest versions from NVIDIA:

1.3 Cleanup

Uninstall old versions if they exist.

bash
# Uninstall Driver
/usr/bin/nvidia-uninstall -s

# Uninstall CUDA
/usr/local/cuda-X.Y/bin/cuda-uninstaller

1.4 Install Dependencies

Install headers and toolchains required for compilation.

CentOS 7:

bash
yum install -y gcc gcc-c++ tar make bzip2 pkgconfig \
    libglvnd-devel elfutils-libelf-devel \
    kernel-devel-$(uname -r) kernel-headers-$(uname -r)

Ubuntu:

bash
apt-get update
apt-get install -y gcc g++ tar make pkg-config build-essential \
    libglvnd-dev linux-headers-$(uname -r)

1.5 Disable Nouveau

Nouveau is the open-source driver and must be disabled to install the official proprietary driver.

  1. Check Status:
    bash
    lsmod | grep nouveau
  2. Blacklist: CentOS/RHEL: /usr/lib/modprobe.d/blacklist-nouveau.confUbuntu: /etc/modprobe.d/blacklist.conf
    bash
    blacklist nouveau
    options nouveau modeset=0
  3. Rebuild Initramfs & Reboot:
    bash
    # CentOS
    dracut -f
    # Ubuntu
    update-initramfs -u
    
    reboot

2. GPU Driver Installation

Runlevel

Installation must be performed in Text Mode (Runlevel 3) with the X-Server stopped.

2.1 Install Driver

Assuming the package is named NVIDIA-Linux-x86_64-xxx.run.

bash
chmod +x NVIDIA-Linux-x86_64-*.run

# -a: Accept license, -s: Silent, -Z: Disable Nouveau, --no-opengl-files: Avoid UI conflicts
./NVIDIA-Linux-x86_64-*.run -a -s -Z --no-opengl-files

2.2 Persistence Mode

Enabling persistence mode prevents the driver from unloading when idle, reducing latency.

bash
# Method A: System Service (Recommended)
systemctl enable nvidia-persistenced
systemctl start nvidia-persistenced

# Method B: Command Line (Temporary)
nvidia-smi -pm 1

2.3 Fabric Manager (SXM Required)

For A100/H100/H800 SXM (NVLink) systems, Fabric Manager is mandatory. Without it, NVLink will not function.

bash
# Install (RPM example, version must match driver exactly)
rpm -ivh nvidia-fabricmanager-*.rpm

# Start Service
systemctl enable --now nvidia-fabricmanager

3. CUDA Toolkit Installation

3.1 Install CUDA

bash
chmod +x cuda_*.run

# --no-opengl-libs: Skip OpenGL libs
# Note: Uncheck "Driver" in the interactive menu (since it's installed in Step 2)
./cuda_*.run --no-opengl-libs

3.2 Environment Variables

Edit /etc/profile or ~/.bashrc:

bash
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Apply and verify:

bash
source /etc/profile
nvcc -V

4. Verification

4.1 Commands

ComponentCommandExpected Result
Driver Statusnvidia-smiShows GPU list, VRAM, Power
CUDA Versionnvcc -VShows version number
Fabric Managersystemctl status nvidia-fabricmanagerActive (Running)
Persistencesystemctl status nvidia-persistencedActive (Running)
Nouveau`lsmodgrep nouveau`

4.2 Troubleshooting

  • rmmod: ERROR: Module nouveau is in use: Nouveau is still running. Ensure you switched to Init 3 and rebooted.
  • Install Failed: Check /var/log/nvidia-installer.log. Often caused by a mismatch between kernel-devel version and uname -r.

AI-HPC Organization