Deep Dive: Enterprise BeeGFS Deployment & Tuning Guide
Abstract: BeeGFS (formerly FhGFS) is a widely used parallel file system in the High-Performance Computing (HPC) domain. Compared to Lustre, it offers significant advantages in lightweight architecture, ease of management, and concurrency handling for small files. This document, based on large-scale production delivery experience, details how to build a BeeGFS cluster supporting PB-scale capacity and Tbps-scale throughput on standard x86 servers.
1. Architecture Design Philosophy
When designing high-performance storage systems, we are not just installing software; we are designing the path of data flow.
1.1 Core Component Logic
BeeGFS adopts a decoupled architecture, mainly consisting of four services:
- Management Service (Mgmtd):
- Role: The "Registry Center" of the cluster, maintaining the status and ID mapping of all service nodes.
- Characteristic: Extremely low load, but critical. If it goes down, the cluster cannot accept new connections (existing connections may persist).
- Metadata Service (Meta):
- Role: Stores the directory tree, permissions, attributes, and data stripe location information.
- Bottleneck: In Massive Small File (LOSF) scenarios, Meta IOPS is the core bottleneck. NVMe SSDs are strongly recommended.
- Storage Service (Storage):
- Role: Stores actual data chunks.
- Strategy: Data is sliced into fixed-size chunks (default 512KB) and distributed across different Storage Targets.
- Client Service (Client):
- Role: Runs on compute nodes, loaded as a kernel module, mapping distributed storage resources to a local POSIX mount point.
1.2 Advanced Architecture: Single-Service Multi-Instance (Multi-Mode)
In modern servers, a single process often cannot saturate PCIe 4.0/5.0 or 100Gb+ network bandwidth. To maximize performance, we recommend the "Single Node Multi-Instance" deployment mode:
- Principle: Start multiple `beegfs-meta` or `beegfs-storage` processes on the same physical machine.
- Advantages:
- NUMA Affinity: Bind different instances to different CPU NUMA nodes to reduce cross-socket memory access.
- Concurrent Queues: Increase the processing queues for network requests to saturate network card bandwidth.
- Planning Example: A server configured with 2 NVMe drives for Meta and 2 RAID6 groups for Storage. Deploy 2 Meta instances and 2 Storage instances.
2. Infrastructure & Environment Preparation
2.1 Hardware Selection Suggestions
- Metadata Node (MDS):
- CPU: High frequency, fewer cores (Meta operations are sensitive to single-core frequency).
- Disk: Must use SSD/NVMe. RAID1 for OS, RAID1/10 for Meta data.
- Storage Node (OSS):
- CPU: Many cores (to handle massive concurrent I/O requests).
- Disk: Large capacity HDD (RAID6 10+2 or 16+2) or All-Flash. RAID controller must have Super Capacitor, Cache policy set to `Always Write Back`.
- Network:
- Management Plane: 1GbE/10GbE TCP.
- Data Plane: InfiniBand (EDR/HDR/NDR) or RoCEv2 (100G/200G/400G).
2.2 OS Tuning (Critical)
Execute the following operations on all storage nodes to reduce system jitter.
1. Disable System Interference
# Stop Firewall and NetworkManager
systemctl stop firewalld && systemctl disable firewalld
systemctl stop NetworkManager && systemctl disable NetworkManager
# Disable SELinux
sed -i 's/^SELINUX=.*/SELINUX=disabled/' /etc/selinux/config
setenforce 02. I/O Scheduler Optimization For SSD/NVMe, use `noop` or `none`; for HDD RAID, use `deadline`.
# Example: Set sdb (SSD) to none
echo none > /sys/block/sdb/queue/scheduler3. Virtual Memory Parameters Reduce swap usage tendency to avoid swapping out from Buffer Cache.
sysctl -w vm.swappiness=1
sysctl -w vm.vfs_cache_pressure=100
echo "vm.swappiness=1" >> /etc/sysctl.conf4. Prepare Yum Repos Ensure `beegfs-mgmtd`, `beegfs-meta`, `beegfs-storage`, `beegfs-client`, `beegfs-helperd`, `beegfs-utils` packages are installed.
3. Deployment: Multi-Instance Mode
Assume the physical machine hostname is `storage01`, planned as follows:
- `/dev/nvme0n1` (2TB): 10GB for Mgmtd, remainder for Meta Instance 1.
- `/dev/nvme1n1` (2TB): All for Meta Instance 2.
- `/dev/sdc` (RAID6): For Storage Instance 1.
- `/dev/sdd` (RAID6): For Storage Instance 2.
3.1 Management Service (Mgmtd) Deployment
# 1. Format and Mount
parted -s /dev/nvme0n1 mklabel gpt mkpart primary 0% 10GB
mkfs.ext4 /dev/nvme0n1p1
mkdir -p /data/beegfs/mgmtd
mount /dev/nvme0n1p1 /data/beegfs/mgmtd
# 2. Initialize Service
/opt/beegfs/sbin/beegfs-setup-mgmtd -p /data/beegfs/mgmtd -f
# 3. Start and Enable
systemctl enable beegfs-mgmtd --now3.2 Metadata Service (Meta) - Multi-Instance
Instance 1 (Meta01):
# 1. Format (Recommend ext4 for small file performance, large inode)
mkfs.ext4 -i 2048 -I 512 -J size=400 -Odir_index,filetype /dev/nvme0n1p2
mkdir -p /data/beegfs/meta01
mount -o noatime,nodiratime,nobarrier /dev/nvme0n1p2 /data/beegfs/meta01
# 2. Initialize (Specify ServiceID=1, Port=8200)
/opt/beegfs/sbin/beegfs-setup-meta -p /data/beegfs/meta01 -s 1 -S meta01 -m YOUR_MGMT_IP -f
# 3. Modify Port (Critical: Avoid Conflict)
sed -i 's/connMetaPortTCP = 8005/connMetaPortTCP = 8200/g' /etc/beegfs/meta01.d/beegfs-meta.conf
sed -i 's/connMetaPortUDP = 8005/connMetaPortUDP = 8200/g' /etc/beegfs/meta01.d/beegfs-meta.conf
# 4. Start
systemctl enable beegfs-meta@meta01 --nowInstance 2 (Meta02): Repeat steps using `/dev/nvme1n1`, mount point `/data/beegfs/meta02`, ServiceID=`2`, Port=`8201`.
3.3 Storage Service (Storage) - Multi-Instance
XFS is recommended for data disks due to better performance with large files and parallel I/O.
Instance 1 (Stor01):
# 1. Format XFS (Optimize RAID stripe alignment, assume stripe width 128k, 10 data disks)
mkfs.xfs -d su=128k,sw=10 -l version=2,su=128k -isize=512 /dev/sdc -f
# 2. Mount (High Performance Params)
mkdir -p /data/beegfs/stor01
mount -o noatime,nodiratime,nobarrier,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k /dev/sdc /data/beegfs/stor01
# 3. Initialize (TargetID=101, Port=8300)
/opt/beegfs/sbin/beegfs-setup-storage -p /data/beegfs/stor01 -s 1 -S stor01 -i 101 -m YOUR_MGMT_IP -f
# 4. Modify Port
sed -i 's/connStoragePortTCP = 8003/connStoragePortTCP = 8300/g' /etc/beegfs/stor01.d/beegfs-storage.conf
sed -i 's/connStoragePortUDP = 8003/connStoragePortUDP = 8300/g' /etc/beegfs/stor01.d/beegfs-storage.conf
# 5. Start
systemctl enable beegfs-storage@stor01 --nowInstance 2 (Stor02): Repeat steps using `/dev/sdd`, mount point `/data/beegfs/stor02`, ServiceID=`2`, TargetID=`201`, Port=`8301`.
4. High-Performance Client Mounting
Client performance directly impacts AI training efficiency.
4.1 Enable RDMA (InfiniBand/RoCE)
Default installation only supports TCP. For IB environments, client modules must be rebuilt.
Edit Autobuild Config:
vi /etc/beegfs/beegfs-client-autobuild.conf
# Modify as follows:
buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1 OFED_INCLUDE_PATH=/usr/src/ofa_kernel/default/include/Execute Rebuild:
/etc/init.d/beegfs-client rebuild4.2 Mount & Verify
# Initialize
/opt/beegfs/sbin/beegfs-setup-client -m YOUR_MGMT_IP
# Start Services
systemctl start beegfs-helperd
systemctl start beegfs-client
# Check Connection Topology
beegfs-net5. Advanced Configuration
5.1 Buddy Mirror (High Availability)
BeeGFS Buddy Mirror provides software-based data redundancy (like RAID10). Even if a Storage Target completely fails, data is safe.
Metadata Mirror (Meta Mirror):
# 1. Stop all Clients
systemctl stop beegfs-client
# 2. Create Mirror Group (Automatic Pairing)
beegfs-ctl --addmirrorgroup --automatic --nodetype=meta
# 3. Activate Mirror
beegfs-ctl --mirrormd
# 4. Restart Meta Services
systemctl restart beegfs-meta@meta01
systemctl restart beegfs-meta@meta02Storage Mirror: Flexible; can be enabled for specific directories.
# 1. Create Mirror Group (Pair ID 101 and 201)
beegfs-ctl --addmirrorgroup --nodetype=storage --primary=101 --secondary=201
# 2. Enable Mirror for Critical Directory
beegfs-ctl --setpattern --pattern=buddymirror /mnt/beegfs/critical_data5.2 Quota Management
Prevents a single user from filling up the entire storage pool.
Steps:
- Server: Set `quotaEnableEnforcement = true` in Mgmt/Meta/Storage configs.
- Storage Mount: Add mount options `uqnoenforce,gqnoenforce`.
- Client: Set `quotaEnabled = true` in `beegfs-client.conf`.
- Init: Run `beegfs-fsck --enablequota`.
- Set Limit:bash
beegfs-ctl --setquota --uid user1 --sizelimit=10T --inodelimit=1000000
5.3 BeeOND (On-Demand Burst Buffer)
In AI training, there are many random small I/Os. BeeOND uses compute nodes' local memory or NVMe to build a temporary BeeGFS cluster as a Burst Buffer.
Start Command:
# Build temp FS on node01-node10 using /local/nvme
beeond start -n nodefile -d /local/nvme -c /mnt/beeond -PData Warm-up:
beeond-cp copy -n nodefile /mnt/beegfs/imagenet /mnt/beeond/imagenet6. Operations & Troubleshooting
6.1 Quick Command Reference
| Command | Function |
|---|---|
beegfs-df | Check Target capacity and Inode usage |
beegfs-ctl --listtargets --state | Check Target online status |
beegfs-check-servers | Check connectivity of all services |
beegfs-net | View current established RDMA/TCP connections |
beegfs-ctl --getentryinfo <file> | View file stripe distribution info |
6.2 Common Issues
- Target Offline: Usually due to disk failure or network interruption. Check
beegfs-storage.log. If disk is replaced, reuse the original TargetID for recovery. - Client Cannot Mount: Check if port 8008 from Client to Mgmtd is open. Check Kernel version mismatch (requires rebuild).
- Low Performance: Check if TCP is used instead of RDMA (use
beegfs-net查看); Check for NUMA cross-node access.
