Rocky Linux 9 HPC Cluster Deployment
Abstract: This document records the complete process of building a High Performance Computing (HPC) cluster based on Rocky Linux 9.4. It covers system initialization, NFS storage, MUNGE authentication, Slurm workload manager, and Prometheus monitoring.
1. Architecture Planning
1.1 Node Roles
| Role | Hostname | IP | Function |
|---|---|---|---|
| Manager | mu01 | 192.168.8.100 | Slurmctld, DB, Monitor, NFS Server |
| Compute | cu01-cu19 | 192.168.8.101+ | Task Execution (Slurmd) |
1.2 Storage
- System: OS Root.
- Data:
/data(NFS Shared). - Apps:
/opt(NFS Shared for compilers/apps). - Home:
/home(NFS Shared for user data).
2. System Initialization
2.1 Network & Security
Execute on ALL nodes:
bash
# 1. Set Hostname
hostnamectl set-hostname mu01
# 2. Disable Firewall & SELinux
systemctl stop firewalld && systemctl disable firewalld
setenforce 0
sed -i 's/^SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
# 3. Hosts File
cat >> /etc/hosts <<EOF
192.168.8.100 mu01
192.168.8.101 cu01
# ...
EOF2.2 SSH Trust
Generate keys on the manager node and distribute:
bash
ssh-keygen -t rsa -N ""
for i in {01..19}; do ssh-copy-id cu$i; done2.3 Chrony NTP
- Manager:
allow 192.168.8.0/24. - Compute:
server 192.168.8.100 iburst.
3. Storage & User Management
3.1 NFS
Manager (Server):
bash
yum install -y nfs-utils
cat >> /etc/exports <<EOF
/home 192.168.8.0/24(rw,sync,no_root_squash)
/opt 192.168.8.0/24(rw,sync,no_root_squash)
EOF
systemctl enable --now nfs-serverCompute (Client):
bash
yum install -y nfs-utils
mount 192.168.8.100:/home /home
mount 192.168.8.100:/opt /opt
# Add to /etc/fstab for persistence3.2 User Sync
For small clusters, sync files directly:
bash
scp /etc/passwd /etc/group /etc/shadow cuXX:/etc/4. MUNGE Authentication
4.1 Install
bash
yum install -y munge munge-libs munge-devel4.2 Key Distribution
Generate key on Manager:
bash
create-munge-key
# Sync to all nodes
scp /etc/munge/munge.key root@cuXX:/etc/munge/4.3 Permissions
On ALL nodes:
bash
chown munge:munge /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
systemctl enable --now munge5. Slurm Deployment
Version: Slurm 24.05
5.1 Database (MariaDB)
On Manager:
bash
yum install -y mariadb-server
systemctl enable --now mariadb
# Create DB 'slurm_acct_db' and user 'slurm'5.2 Build & Install
bash
# Build RPMs
rpmbuild -ta slurm-24.05.2.tar.bz2
# Install
# Manager: slurm, slurm-slurmctld, slurm-slurmdbd
# Compute: slurm, slurm-slurmd5.3 Configuration
Key files: slurm.conf, slurmdbd.conf, cgroup.conf.
Sample slurm.conf:
ini
ClusterName=hpccluster
SlurmctldHost=mu01
MpiDefault=none
ProctrackType=proctrack/cgroup
SlurmUser=slurm
# Nodes
NodeName=cu[01-19] CPUs=24 RealMemory=94000 State=UNKNOWN
PartitionName=debug Nodes=cu[01-19] Default=YES MaxTime=INFINITE State=UP5.4 Start Services
Manager:
bash
systemctl enable --now slurmdbd
systemctl enable --now slurmctldCompute:
bash
systemctl enable --now slurmdVerify: sinfo should show nodes as idle.
6. Monitoring (Prometheus + Grafana)
- Node Exporter: Run on all nodes to export metrics.
- Prometheus: Scrape metrics from all nodes.
- Grafana: Visualize cluster status.
7. Summary
A complete HPC cluster is now ready:
- Compute: Managed by Slurm.
- Storage: Shared via NFS.
- Monitor: Full stack visibility.
