Incident Review: How LD_LIBRARY_PATH Broke dracut and XFS Mount
Abstract
A GPU/HPC server failed to boot with the error: mount: unknown filesystem type 'xfs'. The root cause was not disk failure, but environment variable pollution that broke the initramfs generation process.
1. Symptoms
The system dropped into emergency mode. Manual mounting of XFS partitions worked, but the automated boot process failed to recognize the filesystem type.
2. The Red Flag
blkid -V reported a version from 2003, despite the system being a modern CentOS 7. This caused dracut to fail to detect the root partition type, leading to an initramfs missing the xfs.ko module.
3. Root Cause
LD_LIBRARY_PATH was set globally in /etc/profile to point to CUDA and MPI libraries. These libraries overlapped with system libraries, causing critical binaries like blkid to link to incompatible versions and malfunction.
4. Resolution
- Reinstall
util-linuxande2fsprogsvia RPM to restore binary integrity. - Rebuild the initramfs using
dracut -f. - Verify
xfs.koexistence in the image and reboot.
5. Prevention
Avoid global LD_LIBRARY_PATH. Use ldconfig or Lmod for HPC software stack management.
