Skip to content

Rocky Linux 9 HPC Cluster Deployment

Abstract: This document records the complete process of building a High Performance Computing (HPC) cluster based on Rocky Linux 9.4. It covers system initialization, NFS storage, MUNGE authentication, Slurm workload manager, and Prometheus monitoring.

1. Architecture Planning

1.1 Node Roles

RoleHostnameIPFunction
Managermu01192.168.8.100Slurmctld, DB, Monitor, NFS Server
Computecu01-cu19192.168.8.101+Task Execution (Slurmd)

1.2 Storage

  • System: OS Root.
  • Data: /data (NFS Shared).
  • Apps: /opt (NFS Shared for compilers/apps).
  • Home: /home (NFS Shared for user data).

2. System Initialization

2.1 Network & Security

Execute on ALL nodes:

bash
# 1. Set Hostname
hostnamectl set-hostname mu01

# 2. Disable Firewall & SELinux
systemctl stop firewalld && systemctl disable firewalld
setenforce 0
sed -i 's/^SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config

# 3. Hosts File
cat >> /etc/hosts <<EOF
192.168.8.100   mu01
192.168.8.101   cu01
# ...
EOF

2.2 SSH Trust

Generate keys on the manager node and distribute:

bash
ssh-keygen -t rsa -N ""
for i in {01..19}; do ssh-copy-id cu$i; done

2.3 Chrony NTP

  • Manager: allow 192.168.8.0/24.
  • Compute: server 192.168.8.100 iburst.

3. Storage & User Management

3.1 NFS

Manager (Server):

bash
yum install -y nfs-utils
cat >> /etc/exports <<EOF
/home   192.168.8.0/24(rw,sync,no_root_squash)
/opt    192.168.8.0/24(rw,sync,no_root_squash)
EOF
systemctl enable --now nfs-server

Compute (Client):

bash
yum install -y nfs-utils
mount 192.168.8.100:/home /home
mount 192.168.8.100:/opt /opt
# Add to /etc/fstab for persistence

3.2 User Sync

For small clusters, sync files directly:

bash
scp /etc/passwd /etc/group /etc/shadow cuXX:/etc/

4. MUNGE Authentication

4.1 Install

bash
yum install -y munge munge-libs munge-devel

4.2 Key Distribution

Generate key on Manager:

bash
create-munge-key
# Sync to all nodes
scp /etc/munge/munge.key root@cuXX:/etc/munge/

4.3 Permissions

On ALL nodes:

bash
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl enable --now munge

5. Slurm Deployment

Version: Slurm 24.05

5.1 Database (MariaDB)

On Manager:

bash
yum install -y mariadb-server
systemctl enable --now mariadb
# Create DB 'slurm_acct_db' and user 'slurm'

5.2 Build & Install

bash
# Build RPMs
rpmbuild -ta slurm-24.05.2.tar.bz2

# Install
# Manager: slurm, slurm-slurmctld, slurm-slurmdbd
# Compute: slurm, slurm-slurmd

5.3 Configuration

Key files: slurm.conf, slurmdbd.conf, cgroup.conf.

Sample slurm.conf:

ini
ClusterName=hpccluster
SlurmctldHost=mu01
MpiDefault=none
ProctrackType=proctrack/cgroup
SlurmUser=slurm

# Nodes
NodeName=cu[01-19] CPUs=24 RealMemory=94000 State=UNKNOWN
PartitionName=debug Nodes=cu[01-19] Default=YES MaxTime=INFINITE State=UP

5.4 Start Services

Manager:

bash
systemctl enable --now slurmdbd
systemctl enable --now slurmctld

Compute:

bash
systemctl enable --now slurmd

Verify: sinfo should show nodes as idle.

6. Monitoring (Prometheus + Grafana)

  1. Node Exporter: Run on all nodes to export metrics.
  2. Prometheus: Scrape metrics from all nodes.
  3. Grafana: Visualize cluster status.

7. Summary

A complete HPC cluster is now ready:

  • Compute: Managed by Slurm.
  • Storage: Shared via NFS.
  • Monitor: Full stack visibility.

AI-HPC Organization