Skip to content

LLM Inference Engine: vLLM Architecture & Practice

In the production implementation of Large Language Models (LLMs), inference systems face the dual challenge of high Throughput and low Latency. vLLM, with its revolutionary memory management technology, has become the de facto standard for open-source inference.

1. Core Technology Analysis

1.1 PagedAttention: Solving Memory Fragmentation

Traditional inference engines (like HuggingFace Transformers) typically require pre-allocating contiguous memory space, leading to 60%-80% memory waste (Fragmentation).

vLLM introduces the concept of Virtual Memory from operating systems:

  • KV Cache Paging: Splits the KV Cache into fixed-size blocks (e.g., each block stores 16 tokens).
  • Non-Contiguous Storage: Tensors that are logically contiguous can be physically non-contiguous in VRAM.
  • Dynamic Mapping: Uses a Page Table to record the mapping between logical blocks and physical blocks, enabling on-demand allocation.

1.2 Continuous Batching: Extreme Scheduling

Traditional batching (Static Batching) must wait for all requests in a batch to finish before starting the next batch, leading to the "short jobs waiting for long jobs" problem.

vLLM implements Continuous Batching (Orca mechanism):

  • Iteration-Level Scheduling: Once a request finishes generating (encounters [EOS]), its slot is immediately released to a new request from the waiting queue.
  • Zero Bubbles: The GPU compute units remain busy at all times, eliminating idle waiting.

2. Quick Start

2.1 Environment Preparation

It is recommended to use Docker containers to avoid CUDA version conflicts.

bash
# Based on the official image
docker run --gpus all -it --rm \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:latest /bin/bash

2.2 Offline Inference (Python SDK)

Suitable for offline batch processing tasks (e.g., data cleaning, RAG vector database construction).

python
from vllm import LLM, SamplingParams

# 1. Load the model (Automatically handles Tensor Parallelism)
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", tensor_parallel_size=1)

# 2. Set sampling parameters (Temperature=0.8, Top-P=0.95, Max_Tokens=512)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)

# 3. Prepare Prompts
prompts = [
    "Explain the PagedAttention algorithm in one sentence.",
    "Write a Python function to calculate Fibonacci numbers.",
]

# 4. Execute Inference
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")
    print("-" * 20)

3. Production Service Deployment

vLLM provides a high-performance HTTP server compatible with the OpenAI API.

3.1 Start API Server

bash
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-70B-Instruct \
    --tensor-parallel-size 4 \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.9 \
    --max-num-seqs 256
  • --tensor-parallel-size 4: Use 4 GPUs for Tensor Parallelism.
  • --max-num-seqs 256: Maximum concurrent sequences (controls throughput limit).

3.2 Streaming Call

Streaming output significantly reduces the user's "Time to First Token" (TTFT) perception.

python
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    messages=[{"role": "user", "content": "Explain Quantum Mechanics in 500 words."}],
    stream=True,
)

print("Response: ", end="", flush=True)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

4. Advanced Optimization: AWQ Quantization

For memory-constrained scenarios, Quantization is essential. vLLM has the best support for AWQ (Activation-aware Weight Quantization).

4.1 Advantages

  • Memory Halved: 4-bit quantization compresses the model size to about 1/4 of FP16 (runtime usage is about 1/2 to 1/3, as KV Cache is not compressed).
  • Speed Boost: Reduced memory bandwidth pressure typically increases inference speed by 1.5x - 2x.

4.2 Deploying AWQ Models

Load the AWQ version directly from HuggingFace:

bash
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-70B-Chat-AWQ \
    --quantization awq \
    --dtype half \
    --gpu-memory-utilization 0.95

5. Benchmark

Real-world test data on NVIDIA A100 (80GB) using Llama-2-7B with 128 output tokens.

EngineBatch SizeThroughput (Tokens/s)VRAM Usage (GB)
HuggingFace1648024.5
vLLM (FP16)162100 (4.3x)16.2
vLLM (FP16)256385072.0

Conclusion: By squeezing memory fragmentation, vLLM enables a much larger Batch Size on the same hardware, resulting in a multi-fold increase in throughput.

AI-HPC Organization