Skip to content

Deep Dive: Enterprise BeeGFS Deployment & Tuning Guide

Abstract: BeeGFS (formerly FhGFS) is a widely used parallel file system in the High-Performance Computing (HPC) domain. Compared to Lustre, it offers significant advantages in lightweight architecture, ease of management, and concurrency handling for small files. This document, based on large-scale production delivery experience, details how to build a BeeGFS cluster supporting PB-scale capacity and Tbps-scale throughput on standard x86 servers.

1. Architecture Design Philosophy

When designing high-performance storage systems, we are not just installing software; we are designing the path of data flow.

1.1 Core Component Logic

BeeGFS adopts a decoupled architecture, mainly consisting of four services:

  1. Management Service (Mgmtd):
    • Role: The "Registry Center" of the cluster, maintaining the status and ID mapping of all service nodes.
    • Characteristic: Extremely low load, but critical. If it goes down, the cluster cannot accept new connections (existing connections may persist).
  2. Metadata Service (Meta):
    • Role: Stores the directory tree, permissions, attributes, and data stripe location information.
    • Bottleneck: In Massive Small File (LOSF) scenarios, Meta IOPS is the core bottleneck. NVMe SSDs are strongly recommended.
  3. Storage Service (Storage):
    • Role: Stores actual data chunks.
    • Strategy: Data is sliced into fixed-size chunks (default 512KB) and distributed across different Storage Targets.
  4. Client Service (Client):
    • Role: Runs on compute nodes, loaded as a kernel module, mapping distributed storage resources to a local POSIX mount point.

1.2 Advanced Architecture: Single-Service Multi-Instance (Multi-Mode)

In modern servers, a single process often cannot saturate PCIe 4.0/5.0 or 100Gb+ network bandwidth. To maximize performance, we recommend the "Single Node Multi-Instance" deployment mode:

  • Principle: Start multiple `beegfs-meta` or `beegfs-storage` processes on the same physical machine.
  • Advantages:
    • NUMA Affinity: Bind different instances to different CPU NUMA nodes to reduce cross-socket memory access.
    • Concurrent Queues: Increase the processing queues for network requests to saturate network card bandwidth.
  • Planning Example: A server configured with 2 NVMe drives for Meta and 2 RAID6 groups for Storage. Deploy 2 Meta instances and 2 Storage instances.

2. Infrastructure & Environment Preparation

2.1 Hardware Selection Suggestions

  • Metadata Node (MDS):
    • CPU: High frequency, fewer cores (Meta operations are sensitive to single-core frequency).
    • Disk: Must use SSD/NVMe. RAID1 for OS, RAID1/10 for Meta data.
  • Storage Node (OSS):
    • CPU: Many cores (to handle massive concurrent I/O requests).
    • Disk: Large capacity HDD (RAID6 10+2 or 16+2) or All-Flash. RAID controller must have Super Capacitor, Cache policy set to `Always Write Back`.
  • Network:
    • Management Plane: 1GbE/10GbE TCP.
    • Data Plane: InfiniBand (EDR/HDR/NDR) or RoCEv2 (100G/200G/400G).

2.2 OS Tuning (Critical)

Execute the following operations on all storage nodes to reduce system jitter.

1. Disable System Interference

bash
# Stop Firewall and NetworkManager
systemctl stop firewalld && systemctl disable firewalld
systemctl stop NetworkManager && systemctl disable NetworkManager

# Disable SELinux
sed -i 's/^SELINUX=.*/SELINUX=disabled/' /etc/selinux/config
setenforce 0

2. I/O Scheduler Optimization For SSD/NVMe, use `noop` or `none`; for HDD RAID, use `deadline`.

bash
# Example: Set sdb (SSD) to none
echo none > /sys/block/sdb/queue/scheduler

3. Virtual Memory Parameters Reduce swap usage tendency to avoid swapping out from Buffer Cache.

bash
sysctl -w vm.swappiness=1
sysctl -w vm.vfs_cache_pressure=100
echo "vm.swappiness=1" >> /etc/sysctl.conf

4. Prepare Yum Repos Ensure `beegfs-mgmtd`, `beegfs-meta`, `beegfs-storage`, `beegfs-client`, `beegfs-helperd`, `beegfs-utils` packages are installed.


3. Deployment: Multi-Instance Mode

Assume the physical machine hostname is `storage01`, planned as follows:

  • `/dev/nvme0n1` (2TB): 10GB for Mgmtd, remainder for Meta Instance 1.
  • `/dev/nvme1n1` (2TB): All for Meta Instance 2.
  • `/dev/sdc` (RAID6): For Storage Instance 1.
  • `/dev/sdd` (RAID6): For Storage Instance 2.

3.1 Management Service (Mgmtd) Deployment

bash
# 1. Format and Mount
parted -s /dev/nvme0n1 mklabel gpt mkpart primary 0% 10GB
mkfs.ext4 /dev/nvme0n1p1
mkdir -p /data/beegfs/mgmtd
mount /dev/nvme0n1p1 /data/beegfs/mgmtd

# 2. Initialize Service
/opt/beegfs/sbin/beegfs-setup-mgmtd -p /data/beegfs/mgmtd -f

# 3. Start and Enable
systemctl enable beegfs-mgmtd --now

3.2 Metadata Service (Meta) - Multi-Instance

Instance 1 (Meta01):

bash
# 1. Format (Recommend ext4 for small file performance, large inode)
mkfs.ext4 -i 2048 -I 512 -J size=400 -Odir_index,filetype /dev/nvme0n1p2
mkdir -p /data/beegfs/meta01
mount -o noatime,nodiratime,nobarrier /dev/nvme0n1p2 /data/beegfs/meta01

# 2. Initialize (Specify ServiceID=1, Port=8200)
/opt/beegfs/sbin/beegfs-setup-meta -p /data/beegfs/meta01 -s 1 -S meta01 -m YOUR_MGMT_IP -f

# 3. Modify Port (Critical: Avoid Conflict)
sed -i 's/connMetaPortTCP = 8005/connMetaPortTCP = 8200/g' /etc/beegfs/meta01.d/beegfs-meta.conf
sed -i 's/connMetaPortUDP = 8005/connMetaPortUDP = 8200/g' /etc/beegfs/meta01.d/beegfs-meta.conf

# 4. Start
systemctl enable beegfs-meta@meta01 --now

Instance 2 (Meta02): Repeat steps using `/dev/nvme1n1`, mount point `/data/beegfs/meta02`, ServiceID=`2`, Port=`8201`.

3.3 Storage Service (Storage) - Multi-Instance

XFS is recommended for data disks due to better performance with large files and parallel I/O.

Instance 1 (Stor01):

bash
# 1. Format XFS (Optimize RAID stripe alignment, assume stripe width 128k, 10 data disks)
mkfs.xfs -d su=128k,sw=10 -l version=2,su=128k -isize=512 /dev/sdc -f

# 2. Mount (High Performance Params)
mkdir -p /data/beegfs/stor01
mount -o noatime,nodiratime,nobarrier,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k /dev/sdc /data/beegfs/stor01

# 3. Initialize (TargetID=101, Port=8300)
/opt/beegfs/sbin/beegfs-setup-storage -p /data/beegfs/stor01 -s 1 -S stor01 -i 101 -m YOUR_MGMT_IP -f

# 4. Modify Port
sed -i 's/connStoragePortTCP = 8003/connStoragePortTCP = 8300/g' /etc/beegfs/stor01.d/beegfs-storage.conf
sed -i 's/connStoragePortUDP = 8003/connStoragePortUDP = 8300/g' /etc/beegfs/stor01.d/beegfs-storage.conf

# 5. Start
systemctl enable beegfs-storage@stor01 --now

Instance 2 (Stor02): Repeat steps using `/dev/sdd`, mount point `/data/beegfs/stor02`, ServiceID=`2`, TargetID=`201`, Port=`8301`.


4. High-Performance Client Mounting

Client performance directly impacts AI training efficiency.

4.1 Enable RDMA (InfiniBand/RoCE)

Default installation only supports TCP. For IB environments, client modules must be rebuilt.

Edit Autobuild Config:

bash
vi /etc/beegfs/beegfs-client-autobuild.conf
# Modify as follows:
buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1 OFED_INCLUDE_PATH=/usr/src/ofa_kernel/default/include/

Execute Rebuild:

bash
/etc/init.d/beegfs-client rebuild

4.2 Mount & Verify

bash
# Initialize
/opt/beegfs/sbin/beegfs-setup-client -m YOUR_MGMT_IP

# Start Services
systemctl start beegfs-helperd
systemctl start beegfs-client

# Check Connection Topology
beegfs-net

5. Advanced Configuration

5.1 Buddy Mirror (High Availability)

BeeGFS Buddy Mirror provides software-based data redundancy (like RAID10). Even if a Storage Target completely fails, data is safe.

Metadata Mirror (Meta Mirror):

bash
# 1. Stop all Clients
systemctl stop beegfs-client

# 2. Create Mirror Group (Automatic Pairing)
beegfs-ctl --addmirrorgroup --automatic --nodetype=meta

# 3. Activate Mirror
beegfs-ctl --mirrormd

# 4. Restart Meta Services
systemctl restart beegfs-meta@meta01
systemctl restart beegfs-meta@meta02

Storage Mirror: Flexible; can be enabled for specific directories.

bash
# 1. Create Mirror Group (Pair ID 101 and 201)
beegfs-ctl --addmirrorgroup --nodetype=storage --primary=101 --secondary=201

# 2. Enable Mirror for Critical Directory
beegfs-ctl --setpattern --pattern=buddymirror /mnt/beegfs/critical_data

5.2 Quota Management

Prevents a single user from filling up the entire storage pool.

Steps:

  1. Server: Set `quotaEnableEnforcement = true` in Mgmt/Meta/Storage configs.
  2. Storage Mount: Add mount options `uqnoenforce,gqnoenforce`.
  3. Client: Set `quotaEnabled = true` in `beegfs-client.conf`.
  4. Init: Run `beegfs-fsck --enablequota`.
  5. Set Limit:
    bash
    beegfs-ctl --setquota --uid user1 --sizelimit=10T --inodelimit=1000000

5.3 BeeOND (On-Demand Burst Buffer)

In AI training, there are many random small I/Os. BeeOND uses compute nodes' local memory or NVMe to build a temporary BeeGFS cluster as a Burst Buffer.

Start Command:

bash
# Build temp FS on node01-node10 using /local/nvme
beeond start -n nodefile -d /local/nvme -c /mnt/beeond -P

Data Warm-up:

bash
beeond-cp copy -n nodefile /mnt/beegfs/imagenet /mnt/beeond/imagenet

6. Operations & Troubleshooting

6.1 Quick Command Reference

CommandFunction
beegfs-dfCheck Target capacity and Inode usage
beegfs-ctl --listtargets --stateCheck Target online status
beegfs-check-serversCheck connectivity of all services
beegfs-netView current established RDMA/TCP connections
beegfs-ctl --getentryinfo <file>View file stripe distribution info

6.2 Common Issues

  • Target Offline: Usually due to disk failure or network interruption. Check beegfs-storage.log. If disk is replaced, reuse the original TargetID for recovery.
  • Client Cannot Mount: Check if port 8008 from Client to Mgmtd is open. Check Kernel version mismatch (requires rebuild).
  • Low Performance: Check if TCP is used instead of RDMA (use beegfs-net 查看); Check for NUMA cross-node access.

AI-HPC Organization