AI-HPC Knowledge Overview
This documentation aims to build a vertically integrated knowledge graph, connecting the underlying High Performance Computing (HPC) infrastructure with upper-layer Artificial Intelligence (AI) applications.
🏗️ Part 1: Infrastructure
Goal: Build a computing base with high bandwidth, low latency, and massive parallelism.
- 01. Hardware & Chips: GPU (H100/A100), NPU, Heterogeneous Architectures.
- 02. Cluster Architecture: NVIDIA SuperPod, HPL/STREAM Benchmarking, Topology.
- 03. High-Performance Network: InfiniBand, RoCE v2, NCCL Optimization.
- 04. Parallel Storage: Lustre, GPUDirect Storage (GDS), High Concurrency IO.
🖥️ Part 2: System Software
Goal: Improve resource utilization and abstract underlying hardware differences.
- 05. Automated Provisioning: PXE, Cobbler, Ubuntu Autoinstall.
- 06. Cloud & Scheduling: Kubernetes (Volcano), Slurm, Docker Containerization.
- 07. Heterogeneous Computing: NVIDIA Driver, CUDA Toolkit, MIG.
- 08. AI Compiler: OpenAI Triton, TVM, Operator Fusion.
🧠 Part 3: LLM Technology
Goal: Efficient training and inference on massive clusters.
- 09. Deep Learning Frameworks: PyTorch 2.x, DeepSpeed, Megatron-LM.
- 10. Pre-trained Models: Transformer, MoE, Pre-training Data Pipeline.
- 11. Distributed Training: 3D Parallelism, ZeRO, Mixed Precision.
- 12. Inference Engines: vLLM (PagedAttention), TensorRT-LLM, Quantization.
🚀 Part 4: Applications & AI4S
Goal: Empower industries and explore new scientific paradigms.
- 13. Industry Applications: RAG, Agents, Private Deployment.
- 14. AI for Science:
