Ubuntu 22.04 + Slurm 24.05 Cluster Deployment
Abstract: This document records the complete process of building a dual-node GPU cluster on Ubuntu 22.04.4 LTS. It focuses on compiling Slurm 24.05 from source, configuring MariaDB, adapting NVIDIA Drivers and Cgroups, and running distributed PyTorch jobs.
1. Hardware & Network
1.1 Nodes
| Hostname | IP (Mgmt) | IP (IB/LAN) | Role | GPU |
|---|---|---|---|---|
| serverx42 | 192.168.8.149 | 10.0.0.242 | Master/Compute | 2x RTX 4090 |
| serverx43 | 192.168.8.150 | 10.0.0.243 | Compute | 2x RTX 4090 |
1.2 Netplan Config
Edit /etc/netplan/00-installer-config.yaml:
yaml
network:
ethernets:
eno1:
addresses: [10.0.0.242/24]
routes:
- to: default
via: 10.0.0.1
nameservers:
addresses: [10.0.0.1, 8.8.8.8]
version: 2Apply: netplan apply
2. Base Environment
2.1 Time Sync
bash
apt install -y chrony
systemctl enable --now chrony2.2 MUNGE Authentication
Sync keys between nodes to ensure trust.
bash
# Install
apt-get install -y munge libmunge-dev
# Copy Key from Master to Slave
scp /etc/munge/munge.key root@serverx43:/etc/munge/
# Fix Permissions (Critical)
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl restart munge2.3 Dependencies
Install build tools and libraries for Slurm:
bash
apt-get install -y mariadb-server libmariadb-dev-compat \
python3 libhwloc-dev libpam0g-dev3. Slurm 24.05 Compilation
3.1 Build & Install
bash
wget https://download.schedmd.com/slurm/slurm-24.05.1.tar.bz2
tar -xf slurm-24.05.1.tar.bz2
cd slurm-24.05.1/
./configure --enable-debug --prefix=/usr/local \
--sysconfdir=/usr/local/etc --with-mysql_config=/usr/bin/mysql_config
make -j$(nproc) && make install3.2 Database Setup
bash
systemctl start mariadb
mysql -u root -p
> CREATE DATABASE slurm_acct_db;
> CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'Password123';
> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';3.3 Configuration Files
1. slurm.conf
bash
ClusterName=HPCcluster
SlurmctldHost=serverx42
MpiDefault=none
ProctrackType=proctrack/cgroup
SlurmUser=slurm
# Scheduler
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
# Database
AccountingStorageType=accounting_storage/slurmdbd
# Nodes (GPU Enabled)
GresTypes=gpu
NodeName=serverx[42-43] CPUs=32 RealMemory=64000 Gres=gpu:geforce:2 State=UNKNOWN
PartitionName=SERVER Nodes=serverx[42-43] Default=YES MaxTime=INFINITE State=UP2. gres.conf
bash
NodeName=serverx[42-43] Name=gpu Type=geforce File=/dev/nvidia0 Cores=0-15
NodeName=serverx[42-43] Name=gpu Type=geforce File=/dev/nvidia1 Cores=16-313. cgroup.conf
bash
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes3.4 Start Services
bash
cp etc/slurm*service /lib/systemd/system/
systemctl enable --now slurmdbd slurmctld slurmd4. NVIDIA Drivers & CUDA
4.1 Disable Nouveau
bash
echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf
update-initramfs -u
reboot4.2 Install Driver
bash
./NVIDIA-Linux-x86_64-550.100.run --disable-nouveau --no-opengl-files5. User & Job Testing
5.1 User Sync
Create user with same UID on all nodes:
bash
groupadd -g 2004 test004
useradd -u 2004 -g 2004 -m -s /bin/bash test0045.2 Slurm Account
bash
sacctmgr add cluster HPCcluster
sacctmgr add account normal Description="Default"
sacctmgr create user test004 account=normal5.3 Submit PyTorch Job
submit.sh:
bash
#!/bin/bash
#SBATCH --job-name=torch_test
#SBATCH --partition=SERVER
#SBATCH --nodes=2
#SBATCH --gpus=geforce:4
srun --nodelist=serverx42,serverx43 python -c "import torch; print(torch.cuda.device_count())"Run: sbatch submit.sh
6. Monitoring
Deploy Node Exporter, Prometheus, and Grafana to visualize CPU, Memory, and GPU metrics via port 3000.
