Skip to content

Deep Dive into vLLM: PagedAttention & High-Performance Inference

Abstract: In LLM inference, Memory Bandwidth is often the decisive bottleneck. vLLM addresses the fragmentation of KV Cache by introducing the concept of virtual memory paging from operating systems (PagedAttention), boosting inference throughput by 2-4x.

1. The Bottleneck: Memory Fragmentation & KV Cache

In the auto-regressive generation process of LLMs, the model generates tokens one by one. To avoid redundant computation, we cache the results of previous calculations (Key and Value matrices), known as KV Cache.

The Root of Waste

Traditional inference systems (like HuggingFace Transformers) typically require the KV Cache to be contiguous in memory. This leads to severe fragmentation:

  1. Over-allocation: Memory must be pre-allocated for the maximum sequence length (e.g., 2048) to prevent overflow, even if only 100 are used.
  2. Fragmentation: Even if there is free memory, it cannot be utilized by new requests if it's not a large contiguous block.

Statistics show that in traditional systems, 60% - 80% of memory is actually "wasted" (Internal/External Fragmentation).


2. The Solution: PagedAttention

vLLM's core innovation is introducing the concepts of Virtual Memory and Paging from operating systems.

2.1 Algorithm Principle

PagedAttention allows KV Cache to be stored in non-contiguous physical memory spaces.

  • Logical Block: Similar to an OS Virtual Page.
  • Physical Block: Similar to an OS Physical Page Frame.
  • Block Table: Records the mapping from logical blocks to physical blocks.

2.2 Advantages

  1. Zero Waste: Physical blocks are allocated only when data actually needs to be stored.
  2. Flexible Sharing: Similar to OS Copy-on-Write, different sequences (like Parallel Sampling or Beam Search) can share the same physical blocks (for the Prompt part), saving massive amounts of memory.

3. Another Weapon: Continuous Batching

Beyond memory management, vLLM also implements Continuous Batching (Iteration-level Scheduling).

  • Traditional Static Batching: A batch must wait for all requests to finish before processing the next batch. Short requests are blocked by long ones (Head-of-line blocking).
  • Continuous Batching: Once a request finishes generation (hits EOS), its slot is immediately released for a new request, without waiting for others.

This keeps GPU utilization consistently high.


4. Hands-on: Installation & Usage

4.1 Prerequisites

  • OS: Linux
  • Python: 3.8+
  • CUDA: 11.8 or 12.1 (Recommended)
bash
# Install via pip
pip install vllm

# Verify installation
python -c "import vllm; print(vllm.__version__)"

4.2 Offline Inference

Suitable for batch processing tasks, such as scoring datasets.

python
from vllm import LLM, SamplingParams

# 1. Initialize Engine
# vLLM is aggressive with memory, set utilization manually if needed
llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.9)

# 2. Define Sampling Params
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

# 3. Prepare Inputs
prompts = [
    "Hello, my name is",
    "The capital of France is",
]

# 4. Generate
outputs = llm.generate(prompts, sampling_params)

# 5. Print Results
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

4.3 OpenAI Compatible Server (Online Serving)

vLLM includes a high-performance API Server fully compatible with OpenAI protocols.

Start Server:

bash
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --trust-remote-code \
    --port 8000

Client Call (using openai lib):

python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY", # vLLM doesn't require a key by default
)

completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain PagedAttention."}
    ]
)

print(completion.choices[0].message.content)

5. Benchmark Performance

On NVIDIA A100 (80GB), vLLM typically achieves 2x - 4x throughput improvement compared to HuggingFace Transformers.

FrameworkThroughput (req/s)Latency (ms)
HF Transformers2.5450
HF Text Gen Interface (TGI)5.8120
vLLM12.4110

(Data for reference only, actual performance depends on model size and batch size)

6. Summary

vLLM is more than just an inference library; it demonstrates how to apply HPC system design (virtual memory, scheduling algorithms) to AI algorithms to solve real-world bottlenecks. Understanding vLLM is mandatory for engineers building large-scale AI production systems.

AI-HPC Organization