Build Your Own High-Performance LLM Inference Engine in C++ and CUDA
Learn to build a high-performance LLM inference engine from scratch using C++ and CUDA, covering everything from safetensors to PagedAttention.
Build Your Own High-Performance LLM Inference Engine in C++ and CUDA
Have you ever wondered what happens under the hood when you send a prompt to a large language model? The magic isn't just in the model weights—it's in the inference engine that orchestrates the computation. This post walks you through building your own high-performance LLM inference engine in C++ and CUDA, inspired by the architecture of vLLM but designed as a learning tool.
Why Build Your Own Inference Engine?
Most developers interact with LLMs through high-level APIs like Hugging Face Transformers or OpenAI. But if you want to understand the real performance characteristics—latency, throughput, memory usage—or optimize for your specific hardware, you need to go deeper. Building an inference engine from scratch teaches you:
- How LLMs actually compute predictions
- The role of CUDA kernels in accelerating matrix operations
- Memory management strategies for GPU workloads
- Advanced techniques like PagedAttention and continuous batching
The Architecture of tiny-vllm
The project, tiny-vllm, implements a full inference server for Llama 3.2 1B Instruct. It covers:
- Loading a real LLM from Safetensors format
- Full forward pass (prefill + decode)
- All computation with custom CUDA kernels
- KV cache management
- Static and continuous batching
- Online softmax and FlashAttention-like techniques
- PagedAttention
Getting Started: Technical Prerequisites
To run this on your machine, you'll need:
- Linux (tested on 6.19.8 x86_64)
- CUDA Toolkit (13.1)
- C++17 compiler (GCC 15.2.1)
- An NVIDIA GPU (tested on RTX 5090)
- The model file:
model.safetensorsfrom Llama 3.2 1B Instruct
The only external dependency is nlohmann/json (single header).
Understanding the Safetensors Format
Before writing any code, you need to understand how model weights are stored. A Safetensors file has three sections:
- Header size (8 bytes): A 64-bit unsigned integer indicating the header length
- Header (JSON): Contains metadata about each tensor—name, dtype, shape, and offsets
- Tensor data: The actual weight values
Here's how to read the header size:
std::ifstream safetensors_file("model.safetensors", std::ios_base::binary);
uint64_t header_size;
safetensors_file.read(reinterpret_cast<char *>(&header_size), 8);
The header JSON tells you where each tensor begins and ends within the data section. This allows you to map tensors directly into GPU memory without copying unnecessary data.
The Model Architecture: Llama 3.2 1B
From the Hugging Face model card, we can inspect the architecture:
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=512, bias=False)
(v_proj): Linear(in_features=2048, out_features=512, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
The forward pass follows this sequence:
- Tokenize input text
- Retrieve embeddings for each token
- For each of 16 transformer layers:
- RMSNorm
- Grouped-query attention (with RoPE)
- Residual connection
- RMSNorm
- Feed-forward network (with SiLU)
- Residual connection
- Final RMSNorm
- Linear projection to vocabulary size
- Argmax to select the next token
Your First CUDA Kernel: Embedding Gather
Let's implement the embedding lookup. For each input token, we need to retrieve a 2048-element vector from the embedding table.
__global__ void embeddingGatherKernel(int *gpu_input_tokens, __nv_bfloat16 *input_embeddings, __nv_bfloat16 *embed_tokens)
{
int workIndex = threadIdx.x + blockIdx.x * 2048;
input_embeddings[workIndex] = embed_tokens[gpu_input_tokens[blockIdx.x] * 2048 + threadIdx.x];
input_embeddings[workIndex + 1024] = embed_tokens[gpu_input_tokens[blockIdx.x] * 2048 + threadIdx.x + 1024];
}
Note the trick: since most GPUs support a maximum of 1024 threads per block, but each embedding has 2048 elements, we process two elements per thread. This avoids launching more blocks than necessary.
RMSNorm with Parallel Reduction
RMSNorm normalizes each embedding vector. The formula is:
$$\text{RMS}(a) = \sqrt{\frac{1}{n}\sum_{i=1}^{n}a^2_i}$$
To compute this efficiently on GPU, we use a parallel reduction (tree reduction) in shared memory:
__shared__ float rms_vector[1024];
int workIndex = threadIdx.x + blockIdx.x * 2048;
rms_vector[threadIdx.x] = (float)input[workIndex] * (float)input[workIndex]
+ (float)input[workIndex + 1024] * (float)input[workIndex + 1024];
__syncthreads();
// Tree reduction
for (int i = 1; i < 1024; i = i * 2)
{
if (threadIdx.x % (i * 2) == 0)
{
rms_vector[threadIdx.x] = rms_vector[threadIdx.x] + rms_vector[threadIdx.x + i];
}
__syncthreads();
}
if (threadIdx.x == 0)
{
rms_vector[0] = sqrt(rms_vector[0] / 2048.0 + 1.0e-5);
}
__syncthreads();
output[workIndex] = (__nv_bfloat16)(((float)input[workIndex] / rms_vector[0]) * (float)norm_weights[threadIdx.x]);
output[workIndex + 1024] = (__nv_bfloat16)(((float)input[workIndex + 1024] / rms_vector[0]) * (float)norm_weights[threadIdx.x + 1024]);
The cuBLAS Transposition Trick
cuBLAS expects matrices in column-major format, but models are typically stored in row-major. Instead of converting the data, we can use matrix transposition properties:
$$C = A \times B^T \implies C^T = B \times A^T$$
Since cuBLAS reads row-major data as if it were column-major (effectively transposing it), we set the flags accordingly:
cublasGemmEx(cublas_handle,
CUBLAS_OP_T, CUBLAS_OP_N,
KV_DIM, num_active_slots, EMBEDDING_LENGTH,
&k_proj_alpha, weights.w_k[layer], CUDA_R_16BF, EMBEDDING_LENGTH,
rms_norms, CUDA_R_16BF, EMBEDDING_LENGTH,
&k_proj_beta, k_proj_batched_buffer, CUDA_R_16BF, KV_DIM,
CUBLAS_COMPUTE_32F, CUBLAS_GEMM_DEFAULT);
Why KV Cache Exists
During inference, only the K and V projections from each token need to persist. The Q projection, attention scores, and feed-forward outputs are discarded after each step. By caching K and V, we avoid recomputing them for every new token—a massive performance win.
Prefill vs. Decode
- Prefill: Process all input tokens to generate the first output token. This is compute-bound because you're processing many tokens at once.
- Decode: Generate subsequent tokens one at a time. This is memory-bound because you're fetching cached K and V values.
Grouped-Query Attention (GQA)
In GQA, multiple query heads share the same key and value head. For Llama 3.2 1B, 4 query heads share 1 key/value head. This reduces memory usage and improves throughput without significant quality loss.
for (int i = 0; i < NUM_Q_HEADS; ++i)
{
int k_head_idx = i / GQA_Q_TO_K_RATIO; // 4 Q heads use the same 1 K head
__nv_bfloat16 *q_head = q_proj + i * HEAD_DIM;
__nv_bfloat16 *k_head = k_proj[layer] + k_head_idx * HEAD_DIM;
// ... compute attention scores
}
Continuous Batching
Static batching processes N requests together but waits for the slowest one. Continuous batching solves this by using slots: when a request finishes, its slot is immediately reused for a new request from the queue. This maximizes GPU utilization.
What's Next?
The repository is actively being developed. Upcoming sections include:
- Online softmax derivation and implementation
- PagedAttention for efficient memory management
- Paged KV cache
Conclusion
Building an LLM inference engine from scratch is one of the best ways to deeply understand how modern AI systems work. You'll learn about GPU programming, memory management, linear algebra, and system design—all while building something that actually runs a real model.
If you're a lecturer, feel free to use this as a teaching resource. If you're a learner, fork the repo, make it work on your machine, and contribute your improvements back.
Happy coding!