Skip to content

Practical Guide: Getting Started with OpenAI Triton

Abstract: In the realm of deep learning operator development, OpenAI Triton is emerging as a strong competitor to CUDA. This article introduces the core design philosophy of Triton and demonstrates how to use Python to write GPU kernels that are not only readable but also comparable in performance to hand-written CUDA.

1. Why Triton?

In the AI infrastructure layer, compute efficiency is an eternal theme. Traditional CUDA C++, while powerful, has significant pain points:

  1. Steep Learning Curve: Requires mastery of C++ and GPU architecture details.
  2. Complex Memory Management: Manual handling of memory coalescing and shared memory synchronization is required.

OpenAI Triton is a language and compiler designed to solve these problems. PyTorch 2.0 selected Triton as the default backend for torch.compile, marking it as the de facto standard for the AI compilation stack.

Core Concept: Block-based Programming

Unlike CUDA's SIMT (Single Instruction, Multiple Threads) model, Triton adopts a Block-based Programming paradigm.

  • CUDA: Developers control threadIdx.x (Scalar).
  • Triton: Developers control tl.arange(0, BLOCK_SIZE) (Tensor Block).

This abstraction allows the compiler to automatically optimize memory access patterns, significantly lowering the barrier to entry.


2. Installation

Triton currently works best on Linux with NVIDIA GPUs (AMD support is improving rapidly).

  • OS: Linux (Ubuntu 20.04/22.04)
  • Python: 3.8+
  • GPU: NVIDIA Tesla T4 / A10 / A100 / H100
  • CUDA: 11.8+

Installation Command

Usually, installing PyTorch 2.0+ automatically includes Triton. If you need to install it independently or update to the latest version:

bash
pip install triton
# Verify installation
python -c "import triton; print(f'Triton version: {triton.__version__}')"

3. Core Practice: Vector Addition

We will implement a high-performance Vector Addition operator = X + Y$. Although simple, it covers the complete workflow of Triton programming.

3.1 Import Dependencies

python
import torch
import triton
import triton.language as tl

3.2 Write the Kernel

Triton Kernels are compiled using the @triton.jit decorator. Pay attention to the Mask handling in the code, which is crucial for processing non-block-aligned data.

python
@triton.jit
def add_kernel(
    x_ptr,  # Pointer to X vector
    y_ptr,  # Pointer to Y vector
    output_ptr, # Pointer to Output vector
    n_elements, # Total number of elements
    BLOCK_SIZE: tl.constexpr, # Block size (Compile-time constant)
):
    # 1. Get the current Program ID (Similar to CUDA blockIdx)
    pid = tl.program_id(axis=0)

    # 2. Calculate offsets for the current block
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)

    # 3. Create a mask to prevent out-of-bounds access
    # E.g., if n_elements=100, BLOCK_SIZE=32, the last block needs masking
    mask = offsets < n_elements

    # 4. Load data
    # For out-of-bounds parts, mask=False prevents actual reading; a default value is safe
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)

    # 5. Compute
    output = x + y

    # 6. Store result
    tl.store(output_ptr + offsets, output, mask=mask)

3.3 Host Code

We need to define the Grid size and launch the Kernel.

python
def add(x: torch.Tensor, y: torch.Tensor):
    # Pre-allocate output
    output = torch.empty_like(x)
    n_elements = output.numel()
    
    # Define Grid: A tuple representing how many Programs to launch
    # triton.cdiv is ceiling division
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    
    # Launch Kernel
    # BLOCK_SIZE is passed as a heuristic or fixed parameter
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    
    return output

4. Performance Analysis & Benchmarking

To verify Triton's performance, we use triton.testing.do_bench.

python
@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=['size'],  # Argument name for x-axis
        x_vals=[2**i for i in range(12, 28, 1)],  # Different vector sizes
        x_log=True,  # Log scale for x-axis
        line_arg='provider',  # Argument name to define lines
        line_vals=['triton', 'torch'],  # Values for the line argument
        line_names=['Triton', 'PyTorch'],  # Legend names
        styles=[('blue', '-'), ('green', '-')],  # Line styles
        ylabel='GB/s',  # y-axis label
        plot_name='vector-add-performance',  # Plot name
        args={},  # Other arguments passed to the benchmark function
    )
)
def benchmark(size, provider):
    x = torch.rand(size, device='cuda', dtype=torch.float32)
    y = torch.rand(size, device='cuda', dtype=torch.float32)
    quantiles = [0.5, 0.2, 0.8]
    
    if provider == 'torch':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: x + y, quantiles=quantiles)
    if provider == 'triton':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: add(x, y), quantiles=quantiles)
        
    # Calculate effective bandwidth (GB/s)
    gbps = lambda ms: 3 * x.numel() * x.element_size() / ms * 1e-6
    return gbps(ms), gbps(max_ms), gbps(min_ms)

# Run benchmark
if __name__ == '__main__':
    benchmark.run(print_data=True, show_plots=False)

Theoretical Bandwidth Limit

For Element-wise operations, the bottleneck is usually Memory Bandwidth (Memory Bound). Triton can easily approach the GPU's theoretical bandwidth limit by automatically optimizing memory access:

2430940BW_{eff} = \frac{N_{bytes}}{T_{kernel}}2430940

On an A100, a well-written Triton Kernel can achieve an effective bandwidth of over 1500GB/s.

5. Summary

Triton bridges the gap between Python's ease of use and GPU high performance. By mastering Triton, you can write customized efficient operators for LLM inference (like FlashAttention) or scientific computing without delving deep into PTX assembly.

AI-HPC Organization