Ubuntu 22.04 + Slurm 24.05 Cluster Deployment

Abstract: This document records the complete process of building a dual-node GPU cluster on Ubuntu 22.04.4 LTS. It focuses on compiling Slurm 24.05 from source, configuring MariaDB, adapting NVIDIA Drivers and Cgroups, and running distributed PyTorch jobs.

1. Hardware & Network

1.1 Nodes

Hostname	IP (Mgmt)	IP (IB/LAN)	Role	GPU
serverx42	192.168.8.149	10.0.0.242	Master/Compute	2x RTX 4090
serverx43	192.168.8.150	10.0.0.243	Compute	2x RTX 4090

1.2 Netplan Config

Edit /etc/netplan/00-installer-config.yaml:

yaml

network:
  ethernets:
    eno1:
      addresses: [10.0.0.242/24]
      routes:
      - to: default
        via: 10.0.0.1
      nameservers:
        addresses: [10.0.0.1, 8.8.8.8]
  version: 2

Apply: netplan apply

2. Base Environment

2.1 Time Sync

bash

apt install -y chrony
systemctl enable --now chrony

2.2 MUNGE Authentication

Sync keys between nodes to ensure trust.

bash

# Install
apt-get install -y munge libmunge-dev

# Copy Key from Master to Slave
scp /etc/munge/munge.key root@serverx43:/etc/munge/

# Fix Permissions (Critical)
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl restart munge

2.3 Dependencies

Install build tools and libraries for Slurm:

bash

apt-get install -y mariadb-server libmariadb-dev-compat \
    python3 libhwloc-dev libpam0g-dev

3. Slurm 24.05 Compilation

3.1 Build & Install

bash

wget https://download.schedmd.com/slurm/slurm-24.05.1.tar.bz2
tar -xf slurm-24.05.1.tar.bz2
cd slurm-24.05.1/

./configure --enable-debug --prefix=/usr/local \
    --sysconfdir=/usr/local/etc --with-mysql_config=/usr/bin/mysql_config

make -j$(nproc) && make install

3.2 Database Setup

bash

systemctl start mariadb
mysql -u root -p
> CREATE DATABASE slurm_acct_db;
> CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'Password123';
> GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';

3.3 Configuration Files

1. slurm.conf

bash

ClusterName=HPCcluster
SlurmctldHost=serverx42
MpiDefault=none
ProctrackType=proctrack/cgroup
SlurmUser=slurm

# Scheduler
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory

# Database
AccountingStorageType=accounting_storage/slurmdbd

# Nodes (GPU Enabled)
GresTypes=gpu
NodeName=serverx[42-43] CPUs=32 RealMemory=64000 Gres=gpu:geforce:2 State=UNKNOWN
PartitionName=SERVER Nodes=serverx[42-43] Default=YES MaxTime=INFINITE State=UP

2. gres.conf

bash

NodeName=serverx[42-43] Name=gpu Type=geforce File=/dev/nvidia0 Cores=0-15
NodeName=serverx[42-43] Name=gpu Type=geforce File=/dev/nvidia1 Cores=16-31

3. cgroup.conf

bash

CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes

3.4 Start Services

bash

cp etc/slurm*service /lib/systemd/system/
systemctl enable --now slurmdbd slurmctld slurmd

4. NVIDIA Drivers & CUDA

4.1 Disable Nouveau

bash

echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf
update-initramfs -u
reboot

4.2 Install Driver

bash

./NVIDIA-Linux-x86_64-550.100.run --disable-nouveau --no-opengl-files

5. User & Job Testing

5.1 User Sync

Create user with same UID on all nodes:

bash

groupadd -g 2004 test004
useradd -u 2004 -g 2004 -m -s /bin/bash test004

5.2 Slurm Account

bash

sacctmgr add cluster HPCcluster
sacctmgr add account normal Description="Default"
sacctmgr create user test004 account=normal

5.3 Submit PyTorch Job

submit.sh:

bash

#!/bin/bash
#SBATCH --job-name=torch_test
#SBATCH --partition=SERVER
#SBATCH --nodes=2
#SBATCH --gpus=geforce:4

srun --nodelist=serverx42,serverx43 python -c "import torch; print(torch.cuda.device_count())"

Run: sbatch submit.sh

6. Monitoring

Deploy Node Exporter, Prometheus, and Grafana to visualize CPU, Memory, and GPU metrics via port 3000.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

Ubuntu 22.04 + Slurm 24.05 Cluster Deployment

1. Hardware & Network

1.1 Nodes

1.2 Netplan Config

2. Base Environment

2.1 Time Sync

2.2 MUNGE Authentication

2.3 Dependencies

3. Slurm 24.05 Compilation

3.1 Build & Install

3.2 Database Setup

3.3 Configuration Files

3.4 Start Services

4. NVIDIA Drivers & CUDA

4.1 Disable Nouveau

4.2 Install Driver

5. User & Job Testing

5.1 User Sync

5.2 Slurm Account

5.3 Submit PyTorch Job

6. Monitoring

Ubuntu 22.04 + Slurm 24.05 Cluster Deployment ​

1. Hardware & Network ​

1.1 Nodes ​

1.2 Netplan Config ​

2. Base Environment ​

2.1 Time Sync ​

2.2 MUNGE Authentication ​

2.3 Dependencies ​

3. Slurm 24.05 Compilation ​

3.1 Build & Install ​

3.2 Database Setup ​

3.3 Configuration Files ​

3.4 Start Services ​

4. NVIDIA Drivers & CUDA ​

4.1 Disable Nouveau ​

4.2 Install Driver ​

5. User & Job Testing ​

5.1 User Sync ​

5.2 Slurm Account ​

5.3 Submit PyTorch Job ​

6. Monitoring ​

Ubuntu 22.04 + Slurm 24.05 Cluster Deployment

1. Hardware & Network

1.1 Nodes

1.2 Netplan Config

2. Base Environment

2.1 Time Sync

2.2 MUNGE Authentication

2.3 Dependencies

3. Slurm 24.05 Compilation

3.1 Build & Install

3.2 Database Setup

3.3 Configuration Files

3.4 Start Services

4. NVIDIA Drivers & CUDA

4.1 Disable Nouveau

4.2 Install Driver

5. User & Job Testing

5.1 User Sync

5.2 Slurm Account

5.3 Submit PyTorch Job

6. Monitoring