Skip to content

Ubuntu 22.04 + Slurm 24.05 Cluster Deployment

Abstract: This document records the complete process of building a dual-node GPU cluster on Ubuntu 22.04.4 LTS. It focuses on compiling Slurm 24.05 from source, configuring MariaDB, adapting NVIDIA Drivers and Cgroups, and running distributed PyTorch jobs.

1. Hardware & Network

1.1 Nodes

HostnameIP (Mgmt)IP (IB/LAN)RoleGPU
serverx42192.168.8.14910.0.0.242Master/Compute2x RTX 4090
serverx43192.168.8.15010.0.0.243Compute2x RTX 4090

1.2 Netplan Config

Edit /etc/netplan/00-installer-config.yaml:

yaml
network:
  ethernets:
    eno1:
      addresses: [10.0.0.242/24]
      routes:
      - to: default
        via: 10.0.0.1
      nameservers:
        addresses: [10.0.0.1, 8.8.8.8]
  version: 2

Apply: netplan apply

2. Base Environment

2.1 Time Sync

bash
apt install -y chrony
systemctl enable --now chrony

2.2 MUNGE Authentication

Sync keys between nodes to ensure trust.

bash
# Install
apt-get install -y munge libmunge-dev

# Copy Key from Master to Slave
scp /etc/munge/munge.key root@serverx43:/etc/munge/

# Fix Permissions (Critical)
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl restart munge

2.3 Dependencies

Install build tools and libraries for Slurm:

bash
apt-get install -y mariadb-server libmariadb-dev-compat \
    python3 libhwloc-dev libpam0g-dev

3. Slurm 24.05 Compilation

3.1 Build & Install

bash
wget https://download.schedmd.com/slurm/slurm-24.05.1.tar.bz2
tar -xf slurm-24.05.1.tar.bz2
cd slurm-24.05.1/

./configure --enable-debug --prefix=/usr/local \
    --sysconfdir=/usr/local/etc --with-mysql_config=/usr/bin/mysql_config

make -j$(nproc) && make install

3.2 Database Setup

bash
systemctl start mariadb
mysql -u root -p
> CREATE DATABASE slurm_acct_db;
> CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'Password123';
> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';

3.3 Configuration Files

1. slurm.conf

bash
ClusterName=HPCcluster
SlurmctldHost=serverx42
MpiDefault=none
ProctrackType=proctrack/cgroup
SlurmUser=slurm

# Scheduler
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

# Database
AccountingStorageType=accounting_storage/slurmdbd

# Nodes (GPU Enabled)
GresTypes=gpu
NodeName=serverx[42-43] CPUs=32 RealMemory=64000 Gres=gpu:geforce:2 State=UNKNOWN
PartitionName=SERVER Nodes=serverx[42-43] Default=YES MaxTime=INFINITE State=UP

2. gres.conf

bash
NodeName=serverx[42-43] Name=gpu Type=geforce File=/dev/nvidia0 Cores=0-15
NodeName=serverx[42-43] Name=gpu Type=geforce File=/dev/nvidia1 Cores=16-31

3. cgroup.conf

bash
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes

3.4 Start Services

bash
cp etc/slurm*service /lib/systemd/system/
systemctl enable --now slurmdbd slurmctld slurmd

4. NVIDIA Drivers & CUDA

4.1 Disable Nouveau

bash
echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf
update-initramfs -u
reboot

4.2 Install Driver

bash
./NVIDIA-Linux-x86_64-550.100.run --disable-nouveau --no-opengl-files

5. User & Job Testing

5.1 User Sync

Create user with same UID on all nodes:

bash
groupadd -g 2004 test004
useradd -u 2004 -g 2004 -m -s /bin/bash test004

5.2 Slurm Account

bash
sacctmgr add cluster HPCcluster
sacctmgr add account normal Description="Default"
sacctmgr create user test004 account=normal

5.3 Submit PyTorch Job

submit.sh:

bash
#!/bin/bash
#SBATCH --job-name=torch_test
#SBATCH --partition=SERVER
#SBATCH --nodes=2
#SBATCH --gpus=geforce:4

srun --nodelist=serverx42,serverx43 python -c "import torch; print(torch.cuda.device_count())"

Run: sbatch submit.sh

6. Monitoring

Deploy Node Exporter, Prometheus, and Grafana to visualize CPU, Memory, and GPU metrics via port 3000.

AI-HPC Organization