Skip to content

DeepSeek Architecture: MLA & Extreme MoE Optimization

Abstract: The core reason DeepSeek series models achieve SOTA results at extremely low costs lies in their radical modifications to the Transformer architecture. This article delves into their two killer features: Multi-Head Latent Attention (MLA) and DeepSeekMoE.

1. The Bottlenecks: Memory & Communication

Training trillion-parameter models on massive clusters faces two physical bottlenecks:

  1. KV Cache Memory: Long-sequence inference consumes HBM and limits Batch Size (throughput).
  2. MoE Communication: Traditional MoE (Mixture of Experts) incurs huge cross-node communication overhead (All-to-All) during expert routing.

DeepSeek's architecture is built specifically to solve these HPC problems.


2. Multi-Head Latent Attention (MLA)

Traditional Llama models use GQA (Grouped Query Attention) to reduce KV Cache, but it's not enough. DeepSeek introduces MLA, which uses Low-Rank Compression to squeeze KV Cache to the limit.

2.1 Core Principle

MLA introduces a Compressed KV Vector. Instead of storing huge Key and Value matrices directly, it stores a low-dimensional latent vector {KV}$.

2430940 c_{KV} = W_{DKV} \cdot h 2430940

During inference, Key and Value are recovered via up-projection matrices {UK}$ and {UV}$. Since the projection matrices can be absorbed into the Query computation (associative property of matrices), **we effectively only need to cache the tiny compressed vector {KV}*.

2.2 Benefits

  • Memory Saving: Reduces KV Cache memory usage by 93% compared to standard MHA.
  • Throughput Boost: Smaller KV Cache means larger Batch Sizes can be supported, significantly increasing inference throughput.

3. DeepSeekMoE: Fine-Grained & Shared Experts

Traditional MoE (like Mixtral 8x7B) typically uses "Top-2 Routing", selecting 2 out of 8 large experts. This leads to insufficient knowledge mixing and routing collapse.

DeepSeek introduces two improvements:

3.1 Fine-Grained Experts

Splitting one "Large Expert" into many "Small Experts".

  • Traditional: 8 experts, FF dim 4096 each.
  • DeepSeek: 64 experts, FF dim 512 each.
  • Routing: Select Top-8 instead of Top-2.

Advantage: More flexible expert combinations, allowing precise fitting of different knowledge domains.

3.2 Shared Experts

Regardless of the input, a few fixed experts (Shared Experts) are always activated.

  • Purpose: To capture general knowledge (syntax, logic), letting Routed Experts focus on specialized domain knowledge.
  • Effect: Acts like embedding a Dense model within MoE, stabilizing the training process.

4. Load Balancing without Aux Loss

In MoE training, the fear is that all tokens route to the same expert (load imbalance). Traditional methods add an Auxiliary Loss to penalize imbalance.

However, this interferes with the main task learning. DeepSeek adopts an Auxiliary-Loss-Free strategy, dynamically balancing load purely by adjusting the router's Bias term.

2430940 g_i(x) = \text{Softmax}(u_i + b_i) 2430940

If an expert is overloaded, its Bias $ is dynamically lowered to reduce its selection probability. This ensures HPC efficiency (balanced GPU load) without hurting model performance.

5. Summary

DeepSeek's success is a victory of Co-design between HPC and Algo Engineers.

  • MLA solves the inference memory bottleneck.
  • DeepSeekMoE solves the contradiction between training convergence and knowledge expression.
  • Bias-only Balancing solves cluster load balancing issues.

Understanding DeepSeek architecture is crucial for designing next-generation large model infrastructure.

AI-HPC Organization