Skip to content

RTP (Retention + Transformer) Hybrid Architecture

In 2024-2025, the industry shifted towards a new paradigm: RTP Hybrid (Retention-Transformer Pipeline Hybrid). By combining Retention (Linear Attention) and Transformer (Standard Attention) in a deep pipeline, it significantly reduces KV Cache memory usage during inference and improves long-sequence training efficiency.

Representative Models: DeepSeek V3 / R1, Qwen2-VL, Jamba.

1. The Bottleneck: Transformer's KV Cache Wall

For a sequence of length (L), the KV Cache memory usage in standard Transformer inference is:

[ \text{KV_size} = L \cdot n_{\text{layer}} \cdot d_{\text{kv}} ]

For a 70B model with 128k context, KV Cache alone can consume 100GB+ of VRAM, causing:

  1. OOM: Impossible to infer long contexts on single GPUs.
  2. Bandwidth Bound: Latency is dominated by memory loading, not computation.

2. The RTP Solution

The core idea of RTP is "Structural Division of Labor":

text
Input → [Retention Layers] → [Transformer Layers] → Output
         (Front: 50%-80%)      (Back: 20%-50%)

2.1 Retention Layers (Front)

  • Mechanism: SSM (State Space Models) or Linear Attention.
  • Feature: No KV Cache during inference, maintains only a tiny State.
  • Role: Compresses massive context efficiently.

2.2 Transformer Layers (Back)

  • Mechanism: Standard Multi-Head Attention (MHA/GQA).
  • Feature: Keeps KV Cache, strong reasoning capabilities.
  • Role: Performs deep reasoning based on features extracted by Retention.

3. Performance Comparison

MetricPure TransformerRTP Hybrid (50/50)Benefit
Inference KV100% (Baseline)~50%Halved memory, Double Batch Size
TTFTHigh (O(L^2))Medium (O(L))Faster long-context processing
Training ThroughputBaseline1.2x - 1.5xRetention layers are lighter

4. System-Level Optimization (HPC View)

4.1 Pipeline Parallelism

Retention layers have lower computational density. In PP splitting, more Retention layers are often placed in the same stage or used as a "buffer zone" to reduce pipeline bubbles.

4.2 Inference Engine Adaptation (vLLM)

Adapting RTP in vLLM requires modifying the Scheduler:

  • Physical Block Allocation: Allocate KV Cache Blocks only for Transformer layers.
  • State Management: Allocate continuous small memory for Recurrent State for Retention layers.

5. Pseudocode

python
class RTPBlock(nn.Module):
    def __init__(self, dim, layer_type="retention"):
        super().__init__()
        if layer_type == "retention":
            # Linear complexity, No KV Cache
            self.mixer = Retention(dim) 
        else:
            # Quadratic complexity, Requires KV Cache
            self.mixer = MultiHeadAttention(dim)
        self.ffn = FeedForward(dim)

    def forward(self, x, cache=None):
        x = x + self.mixer(x, cache)
        x = x + self.ffn(x)
        return x

AI-HPC Organization