Deep Dive into vLLM: PagedAttention & High-Performance Inference

Abstract: In LLM inference, Memory Bandwidth is often the decisive bottleneck. vLLM addresses the fragmentation of KV Cache by introducing the concept of virtual memory paging from operating systems (PagedAttention), boosting inference throughput by 2-4x.

1. The Bottleneck: Memory Fragmentation & KV Cache

In the auto-regressive generation process of LLMs, the model generates tokens one by one. To avoid redundant computation, we cache the results of previous calculations (Key and Value matrices), known as KV Cache.

The Root of Waste

Traditional inference systems (like HuggingFace Transformers) typically require the KV Cache to be contiguous in memory. This leads to severe fragmentation:

Over-allocation: Memory must be pre-allocated for the maximum sequence length (e.g., 2048) to prevent overflow, even if only 100 are used.
Fragmentation: Even if there is free memory, it cannot be utilized by new requests if it's not a large contiguous block.

Statistics show that in traditional systems, 60% - 80% of memory is actually "wasted" (Internal/External Fragmentation).

2. The Solution: PagedAttention

vLLM's core innovation is introducing the concepts of Virtual Memory and Paging from operating systems.

2.1 Algorithm Principle

PagedAttention allows KV Cache to be stored in non-contiguous physical memory spaces.

Logical Block: Similar to an OS Virtual Page.
Physical Block: Similar to an OS Physical Page Frame.
Block Table: Records the mapping from logical blocks to physical blocks.

2.2 Advantages

Zero Waste: Physical blocks are allocated only when data actually needs to be stored.
Flexible Sharing: Similar to OS Copy-on-Write, different sequences (like Parallel Sampling or Beam Search) can share the same physical blocks (for the Prompt part), saving massive amounts of memory.

3. Another Weapon: Continuous Batching

Beyond memory management, vLLM also implements Continuous Batching (Iteration-level Scheduling).

Traditional Static Batching: A batch must wait for all requests to finish before processing the next batch. Short requests are blocked by long ones (Head-of-line blocking).
Continuous Batching: Once a request finishes generation (hits EOS), its slot is immediately released for a new request, without waiting for others.

This keeps GPU utilization consistently high.

4. Hands-on: Installation & Usage

4.1 Prerequisites

OS: Linux
Python: 3.8+
CUDA: 11.8 or 12.1 (Recommended)

bash

# Install via pip
pip install vllm

# Verify installation
python -c "import vllm; print(vllm.__version__)"

4.2 Offline Inference

Suitable for batch processing tasks, such as scoring datasets.

python

from vllm import LLM, SamplingParams

# 1. Initialize Engine
# vLLM is aggressive with memory, set utilization manually if needed
llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.9)

# 2. Define Sampling Params
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

# 3. Prepare Inputs
prompts = [
    "Hello, my name is",
    "The capital of France is",
]

# 4. Generate
outputs = llm.generate(prompts, sampling_params)

# 5. Print Results
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

4.3 OpenAI Compatible Server (Online Serving)

vLLM includes a high-performance API Server fully compatible with OpenAI protocols.

Start Server:

bash

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --trust-remote-code \
    --port 8000

Client Call (using openai lib):

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY", # vLLM doesn't require a key by default
)

completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain PagedAttention."}
    ]
)

print(completion.choices[0].message.content)

5. Benchmark Performance

On NVIDIA A100 (80GB), vLLM typically achieves 2x - 4x throughput improvement compared to HuggingFace Transformers.

Framework	Throughput (req/s)	Latency (ms)
HF Transformers	2.5	450
HF Text Gen Interface (TGI)	5.8	120
vLLM	12.4	110

(Data for reference only, actual performance depends on model size and batch size)

6. Summary

vLLM is more than just an inference library; it demonstrates how to apply HPC system design (virtual memory, scheduling algorithms) to AI algorithms to solve real-world bottlenecks. Understanding vLLM is mandatory for engineers building large-scale AI production systems.

01. Hardware & Chips

02. Cluster Architecture

03. Network (IB/RoCE)

04. Storage Systems

05. Automated Provisioning

06. Cloud & Scheduling

07. Heterogeneous Computing

08. AI Compiler

09. Frameworks

10. Pre-trained Models

11. Distributed Training

12. Inference Engines

13. Industry Apps

14. AI for Science

Deep Dive into vLLM: PagedAttention & High-Performance Inference

1. The Bottleneck: Memory Fragmentation & KV Cache

The Root of Waste

2. The Solution: PagedAttention

2.1 Algorithm Principle

2.2 Advantages

3. Another Weapon: Continuous Batching

4. Hands-on: Installation & Usage

4.1 Prerequisites

4.2 Offline Inference

4.3 OpenAI Compatible Server (Online Serving)

5. Benchmark Performance

6. Summary

Deep Dive into vLLM: PagedAttention & High-Performance Inference ​

1. The Bottleneck: Memory Fragmentation & KV Cache ​

The Root of Waste ​

2. The Solution: PagedAttention ​

2.1 Algorithm Principle ​

2.2 Advantages ​

3. Another Weapon: Continuous Batching ​

4. Hands-on: Installation & Usage ​

4.1 Prerequisites ​

4.2 Offline Inference ​

4.3 OpenAI Compatible Server (Online Serving) ​

5. Benchmark Performance ​

6. Summary ​

Deep Dive into vLLM: PagedAttention & High-Performance Inference

1. The Bottleneck: Memory Fragmentation & KV Cache

The Root of Waste

2. The Solution: PagedAttention

2.1 Algorithm Principle

2.2 Advantages

3. Another Weapon: Continuous Batching

4. Hands-on: Installation & Usage

4.1 Prerequisites

4.2 Offline Inference

4.3 OpenAI Compatible Server (Online Serving)

5. Benchmark Performance

6. Summary