Build Your Own High-Performance LLM Inference Engine in C++ and CUDA

Have you ever wondered what happens under the hood when you send a prompt to a large language model? The magic isn't just in the model weights—it's in the inference engine that orchestrates the computation. This post walks you through building your own high-performance LLM inference engine in C++ and CUDA, inspired by the architecture of vLLM but designed as a learning tool.

Why Build Your Own Inference Engine?

Most developers interact with LLMs through high-level APIs like Hugging Face Transformers or OpenAI. But if you want to understand the real performance characteristics—latency, throughput, memory usage—or optimize for your specific hardware, you need to go deeper. Building an inference engine from scratch teaches you:

How LLMs actually compute predictions
The role of CUDA kernels in accelerating matrix operations
Memory management strategies for GPU workloads
Advanced techniques like PagedAttention and continuous batching

The Architecture of tiny-vllm

The project, tiny-vllm, implements a full inference server for Llama 3.2 1B Instruct. It covers:

Loading a real LLM from Safetensors format
Full forward pass (prefill + decode)
All computation with custom CUDA kernels
KV cache management
Static and continuous batching
Online softmax and FlashAttention-like techniques
PagedAttention

Getting Started: Technical Prerequisites

To run this on your machine, you'll need:

Linux (tested on 6.19.8 x86_64)
CUDA Toolkit (13.1)
C++17 compiler (GCC 15.2.1)
An NVIDIA GPU (tested on RTX 5090)
The model file: model.safetensors from Llama 3.2 1B Instruct

The only external dependency is nlohmann/json (single header).

Understanding the Safetensors Format

Before writing any code, you need to understand how model weights are stored. A Safetensors file has three sections:

Header size (8 bytes): A 64-bit unsigned integer indicating the header length
Header (JSON): Contains metadata about each tensor—name, dtype, shape, and offsets
Tensor data: The actual weight values

Here's how to read the header size:

std::ifstream safetensors_file("model.safetensors", std::ios_base::binary);
uint64_t header_size;
safetensors_file.read(reinterpret_cast<char *>(&header_size), 8);

The header JSON tells you where each tensor begins and ends within the data section. This allows you to map tensors directly into GPU memory without copying unnecessary data.

The Model Architecture: Llama 3.2 1B

From the Hugging Face model card, we can inspect the architecture:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)

The forward pass follows this sequence:

Tokenize input text
Retrieve embeddings for each token
For each of 16 transformer layers:
- RMSNorm
- Grouped-query attention (with RoPE)
- Residual connection
- RMSNorm
- Feed-forward network (with SiLU)
- Residual connection
Final RMSNorm
Linear projection to vocabulary size
Argmax to select the next token

Your First CUDA Kernel: Embedding Gather

Let's implement the embedding lookup. For each input token, we need to retrieve a 2048-element vector from the embedding table.

__global__ void embeddingGatherKernel(int *gpu_input_tokens, __nv_bfloat16 *input_embeddings, __nv_bfloat16 *embed_tokens)
{
    int workIndex = threadIdx.x + blockIdx.x * 2048;
    input_embeddings[workIndex] = embed_tokens[gpu_input_tokens[blockIdx.x] * 2048 + threadIdx.x];
    input_embeddings[workIndex + 1024] = embed_tokens[gpu_input_tokens[blockIdx.x] * 2048 + threadIdx.x + 1024];
}

Note the trick: since most GPUs support a maximum of 1024 threads per block, but each embedding has 2048 elements, we process two elements per thread. This avoids launching more blocks than necessary.

RMSNorm with Parallel Reduction

RMSNorm normalizes each embedding vector. The formula is:

$$\text{RMS}(a) = \sqrt{\frac{1}{n}\sum_{i=1}^{n}a^2_i}$$

To compute this efficiently on GPU, we use a parallel reduction (tree reduction) in shared memory:

__shared__ float rms_vector[1024];
int workIndex = threadIdx.x + blockIdx.x * 2048;
rms_vector[threadIdx.x] = (float)input[workIndex] * (float)input[workIndex] 
                        + (float)input[workIndex + 1024] * (float)input[workIndex + 1024];
__syncthreads();

// Tree reduction
for (int i = 1; i < 1024; i = i * 2)
{
    if (threadIdx.x % (i * 2) == 0)
    {
        rms_vector[threadIdx.x] = rms_vector[threadIdx.x] + rms_vector[threadIdx.x + i];
    }
    __syncthreads();
}

if (threadIdx.x == 0)
{
    rms_vector[0] = sqrt(rms_vector[0] / 2048.0 + 1.0e-5);
}
__syncthreads();

output[workIndex] = (__nv_bfloat16)(((float)input[workIndex] / rms_vector[0]) * (float)norm_weights[threadIdx.x]);
output[workIndex + 1024] = (__nv_bfloat16)(((float)input[workIndex + 1024] / rms_vector[0]) * (float)norm_weights[threadIdx.x + 1024]);

The cuBLAS Transposition Trick

cuBLAS expects matrices in column-major format, but models are typically stored in row-major. Instead of converting the data, we can use matrix transposition properties:

$$C = A \times B^T \implies C^T = B \times A^T$$

Since cuBLAS reads row-major data as if it were column-major (effectively transposing it), we set the flags accordingly:

cublasGemmEx(cublas_handle, 
             CUBLAS_OP_T, CUBLAS_OP_N, 
             KV_DIM, num_active_slots, EMBEDDING_LENGTH, 
             &k_proj_alpha, weights.w_k[layer], CUDA_R_16BF, EMBEDDING_LENGTH, 
             rms_norms, CUDA_R_16BF, EMBEDDING_LENGTH, 
             &k_proj_beta, k_proj_batched_buffer, CUDA_R_16BF, KV_DIM, 
             CUBLAS_COMPUTE_32F, CUBLAS_GEMM_DEFAULT);

Why KV Cache Exists

During inference, only the K and V projections from each token need to persist. The Q projection, attention scores, and feed-forward outputs are discarded after each step. By caching K and V, we avoid recomputing them for every new token—a massive performance win.

Prefill vs. Decode

Prefill: Process all input tokens to generate the first output token. This is compute-bound because you're processing many tokens at once.
Decode: Generate subsequent tokens one at a time. This is memory-bound because you're fetching cached K and V values.

Grouped-Query Attention (GQA)

In GQA, multiple query heads share the same key and value head. For Llama 3.2 1B, 4 query heads share 1 key/value head. This reduces memory usage and improves throughput without significant quality loss.

for (int i = 0; i < NUM_Q_HEADS; ++i)
{
    int k_head_idx = i / GQA_Q_TO_K_RATIO; // 4 Q heads use the same 1 K head
    __nv_bfloat16 *q_head = q_proj + i * HEAD_DIM;
    __nv_bfloat16 *k_head = k_proj[layer] + k_head_idx * HEAD_DIM;
    // ... compute attention scores
}

Continuous Batching

Static batching processes N requests together but waits for the slowest one. Continuous batching solves this by using slots: when a request finishes, its slot is immediately reused for a new request from the queue. This maximizes GPU utilization.

What's Next?

The repository is actively being developed. Upcoming sections include:

Online softmax derivation and implementation
PagedAttention for efficient memory management
Paged KV cache

Conclusion

Building an LLM inference engine from scratch is one of the best ways to deeply understand how modern AI systems work. You'll learn about GPU programming, memory management, linear algebra, and system design—all while building something that actually runs a real model.

If you're a lecturer, feel free to use this as a teaching resource. If you're a learner, fork the repo, make it work on your machine, and contribute your improvements back.

Happy coding!

Source

jmaczan/tiny-vllm: Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM