AI-HPC Knowledge Overview

This documentation aims to build a vertically integrated knowledge graph, connecting the underlying High Performance Computing (HPC) infrastructure with upper-layer Artificial Intelligence (AI) applications.

🏗️ Part 1: Infrastructure

Goal: Build a computing base with high bandwidth, low latency, and massive parallelism.

01. Hardware & Chips: GPU (H100/A100), NPU, Heterogeneous Architectures.
02. Cluster Architecture: NVIDIA SuperPod, HPL/STREAM Benchmarking, Topology.
03. High-Performance Network: InfiniBand, RoCE v2, NCCL Optimization.
04. Parallel Storage: Lustre, GPUDirect Storage (GDS), High Concurrency IO.

🖥️ Part 2: System Software

Goal: Improve resource utilization and abstract underlying hardware differences.

05. Automated Provisioning: PXE, Cobbler, Ubuntu Autoinstall.
06. Cloud & Scheduling: Kubernetes (Volcano), Slurm, Docker Containerization.
07. Heterogeneous Computing: NVIDIA Driver, CUDA Toolkit, MIG.
08. AI Compiler: OpenAI Triton, TVM, Operator Fusion.

🧠 Part 3: LLM Technology

Goal: Efficient training and inference on massive clusters.

09. Deep Learning Frameworks: PyTorch 2.x, DeepSpeed, Megatron-LM.
10. Pre-trained Models: Transformer, MoE, Pre-training Data Pipeline.
11. Distributed Training: 3D Parallelism, ZeRO, Mixed Precision.
12. Inference Engines: vLLM (PagedAttention), TensorRT-LLM, Quantization.

🚀 Part 4: Applications & AI4S

Goal: Empower industries and explore new scientific paradigms.

13. Industry Applications: RAG, Agents, Private Deployment.
14. AI for Science:

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

AI-HPC Knowledge Overview

🏗️ Part 1: Infrastructure

🖥️ Part 2: System Software

🧠 Part 3: LLM Technology

🚀 Part 4: Applications & AI4S

AI-HPC Knowledge Overview ​

🏗️ Part 1: Infrastructure ​

🖥️ Part 2: System Software ​

🧠 Part 3: LLM Technology ​

🚀 Part 4: Applications & AI4S ​

AI-HPC Knowledge Overview

🏗️ Part 1: Infrastructure

🖥️ Part 2: System Software

🧠 Part 3: LLM Technology

🚀 Part 4: Applications & AI4S