GPU Driver & CUDA Installation Guide
Abstract: This guide provides the Standard Operating Procedure (SOP) for deploying NVIDIA GPU environments on Linux servers. It covers hardware checks, dependency installation, disabling Nouveau, driver installation, and CUDA configuration, including specific steps for SXM Fabric Manager.
1. Preparation
1.1 Hardware & System Check
Ensure the GPU is physically connected and detected.
# Check PCI devices
lspci | grep -i nvidiaIf no output, check physical slots or power connectors.
1.2 Download Software
Download the latest versions from NVIDIA:
- Driver: NVIDIA Driver Downloads (Linux 64-bit)
- CUDA: CUDA Toolkit Archive (Runfile (local) recommended)
1.3 Cleanup
Uninstall old versions if they exist.
# Uninstall Driver
/usr/bin/nvidia-uninstall -s
# Uninstall CUDA
/usr/local/cuda-X.Y/bin/cuda-uninstaller1.4 Install Dependencies
Install headers and toolchains required for compilation.
CentOS 7:
yum install -y gcc gcc-c++ tar make bzip2 pkgconfig \
libglvnd-devel elfutils-libelf-devel \
kernel-devel-$(uname -r) kernel-headers-$(uname -r)Ubuntu:
apt-get update
apt-get install -y gcc g++ tar make pkg-config build-essential \
libglvnd-dev linux-headers-$(uname -r)1.5 Disable Nouveau
Nouveau is the open-source driver and must be disabled to install the official proprietary driver.
- Check Status:bash
lsmod | grep nouveau - Blacklist: CentOS/RHEL:
/usr/lib/modprobe.d/blacklist-nouveau.confUbuntu:/etc/modprobe.d/blacklist.confbashblacklist nouveau options nouveau modeset=0 - Rebuild Initramfs & Reboot:bash
# CentOS dracut -f # Ubuntu update-initramfs -u reboot
2. GPU Driver Installation
Runlevel
Installation must be performed in Text Mode (Runlevel 3) with the X-Server stopped.
2.1 Install Driver
Assuming the package is named NVIDIA-Linux-x86_64-xxx.run.
chmod +x NVIDIA-Linux-x86_64-*.run
# -a: Accept license, -s: Silent, -Z: Disable Nouveau, --no-opengl-files: Avoid UI conflicts
./NVIDIA-Linux-x86_64-*.run -a -s -Z --no-opengl-files2.2 Persistence Mode
Enabling persistence mode prevents the driver from unloading when idle, reducing latency.
# Method A: System Service (Recommended)
systemctl enable nvidia-persistenced
systemctl start nvidia-persistenced
# Method B: Command Line (Temporary)
nvidia-smi -pm 12.3 Fabric Manager (SXM Required)
For A100/H100/H800 SXM (NVLink) systems, Fabric Manager is mandatory. Without it, NVLink will not function.
# Install (RPM example, version must match driver exactly)
rpm -ivh nvidia-fabricmanager-*.rpm
# Start Service
systemctl enable --now nvidia-fabricmanager3. CUDA Toolkit Installation
3.1 Install CUDA
chmod +x cuda_*.run
# --no-opengl-libs: Skip OpenGL libs
# Note: Uncheck "Driver" in the interactive menu (since it's installed in Step 2)
./cuda_*.run --no-opengl-libs3.2 Environment Variables
Edit /etc/profile or ~/.bashrc:
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHApply and verify:
source /etc/profile
nvcc -V4. Verification
4.1 Commands
| Component | Command | Expected Result |
|---|---|---|
| Driver Status | nvidia-smi | Shows GPU list, VRAM, Power |
| CUDA Version | nvcc -V | Shows version number |
| Fabric Manager | systemctl status nvidia-fabricmanager | Active (Running) |
| Persistence | systemctl status nvidia-persistenced | Active (Running) |
| Nouveau | `lsmod | grep nouveau` |
4.2 Troubleshooting
rmmod: ERROR: Module nouveau is in use: Nouveau is still running. Ensure you switched to Init 3 and rebooted.- Install Failed: Check
/var/log/nvidia-installer.log. Often caused by a mismatch betweenkernel-develversion anduname -r.
