Rocky Linux 9 HPC Cluster Deployment

Abstract: This document records the complete process of building a High Performance Computing (HPC) cluster based on Rocky Linux 9.4. It covers system initialization, NFS storage, MUNGE authentication, Slurm workload manager, and Prometheus monitoring.

1. Architecture Planning

1.1 Node Roles

Role	Hostname	IP	Function
Manager	`mu01`	192.168.8.100	Slurmctld, DB, Monitor, NFS Server
Compute	`cu01-cu19`	192.168.8.101+	Task Execution (Slurmd)

1.2 Storage

System: OS Root.
Data: /data (NFS Shared).
Apps: /opt (NFS Shared for compilers/apps).
Home: /home (NFS Shared for user data).

2. System Initialization

2.1 Network & Security

Execute on ALL nodes:

bash

# 1. Set Hostname
hostnamectl set-hostname mu01

# 2. Disable Firewall & SELinux
systemctl stop firewalld && systemctl disable firewalld
setenforce 0
sed -i 's/^SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config

# 3. Hosts File
cat >> /etc/hosts <<EOF
192.168.8.100   mu01
192.168.8.101   cu01
# ...
EOF

2.2 SSH Trust

Generate keys on the manager node and distribute:

bash

ssh-keygen -t rsa -N ""
for i in {01..19}; do ssh-copy-id cu$i; done

2.3 Chrony NTP

Manager: allow 192.168.8.0/24.
Compute: server 192.168.8.100 iburst.

3. Storage & User Management

3.1 NFS

Manager (Server):

bash

yum install -y nfs-utils
cat >> /etc/exports <<EOF
/home   192.168.8.0/24(rw,sync,no_root_squash)
/opt    192.168.8.0/24(rw,sync,no_root_squash)
EOF
systemctl enable --now nfs-server

Compute (Client):

bash

yum install -y nfs-utils
mount 192.168.8.100:/home /home
mount 192.168.8.100:/opt /opt
# Add to /etc/fstab for persistence

3.2 User Sync

For small clusters, sync files directly:

bash

scp /etc/passwd /etc/group /etc/shadow cuXX:/etc/

4. MUNGE Authentication

4.1 Install

bash

yum install -y munge munge-libs munge-devel

4.2 Key Distribution

Generate key on Manager:

bash

create-munge-key
# Sync to all nodes
scp /etc/munge/munge.key root@cuXX:/etc/munge/

4.3 Permissions

On ALL nodes:

bash

chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl enable --now munge

5. Slurm Deployment

Version: Slurm 24.05

5.1 Database (MariaDB)

On Manager:

bash

yum install -y mariadb-server
systemctl enable --now mariadb
# Create DB 'slurm_acct_db' and user 'slurm'

5.2 Build & Install

bash

# Build RPMs
rpmbuild -ta slurm-24.05.2.tar.bz2

# Install
# Manager: slurm, slurm-slurmctld, slurm-slurmdbd
# Compute: slurm, slurm-slurmd

5.3 Configuration

Key files: slurm.conf, slurmdbd.conf, cgroup.conf.

Sample slurm.conf:

ini

ClusterName=hpccluster
SlurmctldHost=mu01
MpiDefault=none
ProctrackType=proctrack/cgroup
SlurmUser=slurm

# Nodes
NodeName=cu[01-19] CPUs=24 RealMemory=94000 State=UNKNOWN
PartitionName=debug Nodes=cu[01-19] Default=YES MaxTime=INFINITE State=UP

5.4 Start Services

Manager:

bash

systemctl enable --now slurmdbd
systemctl enable --now slurmctld

Compute:

bash

systemctl enable --now slurmd

Verify: sinfo should show nodes as idle.

6. Monitoring (Prometheus + Grafana)

Node Exporter: Run on all nodes to export metrics.
Prometheus: Scrape metrics from all nodes.
Grafana: Visualize cluster status.

7. Summary

A complete HPC cluster is now ready:

Compute: Managed by Slurm.
Storage: Shared via NFS.
Monitor: Full stack visibility.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

Rocky Linux 9 HPC Cluster Deployment

1. Architecture Planning

1.1 Node Roles

1.2 Storage

2. System Initialization

2.1 Network & Security

2.2 SSH Trust

2.3 Chrony NTP

3. Storage & User Management

3.1 NFS

3.2 User Sync

4. MUNGE Authentication

4.1 Install

4.2 Key Distribution

4.3 Permissions

5. Slurm Deployment

5.1 Database (MariaDB)

5.2 Build & Install

5.3 Configuration

5.4 Start Services

6. Monitoring (Prometheus + Grafana)

7. Summary

Rocky Linux 9 HPC Cluster Deployment ​

1. Architecture Planning ​

1.1 Node Roles ​

1.2 Storage ​

2. System Initialization ​

2.1 Network & Security ​

2.2 SSH Trust ​

2.3 Chrony NTP ​

3. Storage & User Management ​

3.1 NFS ​

3.2 User Sync ​

4. MUNGE Authentication ​

4.1 Install ​

4.2 Key Distribution ​

4.3 Permissions ​

5. Slurm Deployment ​

5.1 Database (MariaDB) ​

5.2 Build & Install ​

5.3 Configuration ​

5.4 Start Services ​

6. Monitoring (Prometheus + Grafana) ​

7. Summary ​

Rocky Linux 9 HPC Cluster Deployment

1. Architecture Planning

1.1 Node Roles

1.2 Storage

2. System Initialization

2.1 Network & Security

2.2 SSH Trust

2.3 Chrony NTP

3. Storage & User Management

3.1 NFS

3.2 User Sync

4. MUNGE Authentication

4.1 Install

4.2 Key Distribution

4.3 Permissions

5. Slurm Deployment

5.1 Database (MariaDB)

5.2 Build & Install

5.3 Configuration

5.4 Start Services

6. Monitoring (Prometheus + Grafana)

7. Summary