Build DeepSeek from scratch - Part 7: The KV Cache Memory Problem

Introduction: Why Memory Matters in Large AI Models

Welcome to the start of our journey to understand DeepSeek's innovative design.

As students of Computer Science and AI, you know that making big models fast is a key challenge. When a Large Language Model (LLM) generates text, it needs to be quick and efficient.

Our goal in this first part is to understand the big memory problem that slows down LLMs during inference (when they generate text). We will learn about the Key-Value (KV) Cache and the first technique used to shrink it: Multi-Query Attention (MQA). Understanding MQA is important because it shows us the first step researchers took to solve the high cost of memory in models like DeepSeek.

The Basics: How LLMs Predict the Next Word

When an LLM predicts the next word, it needs an input sequence of tokens. For example, if the input is "The next day is," the model must predict the next token, like "bright".

The model passes the input through its many layers, including the Multi-Head Attention (MHA) layer. To predict the new token, the model only needs the context vector of the last token ("is").

The Good Side of Caching

When the model predicts the second word, the input sequence grows (e.g., "The next day is bright"). If we compute the output for this new, longer sequence, we end up recalculating the attention scores, Keys, and Values for the old tokens ("The next day is").

Caching fixes this problem. We store (cache) the Keys and Values (KV) from the previous step. When a new token arrives, we only compute the Keys and Values for the new token. We then append these new values to the stored KV Cache.

This process helps a lot: using the KV Cache means the number of computations needed increases linearly with input size. If we do not use caching, the computations increase quadratically (much, much slower). This reduction in computation reduces cost and speeds things up.

The Bad Side: The KV Cache Memory Problem

While caching speeds up computation, it creates a serious memory problem.

You must pay a price for every piece of data you store in memory.

The size of the KV Cache depends on many factors, including:

$L$: The number of Transformer blocks.
$B$: The batch size.
$N$: The number of attention heads.
$H$: The head dimension.
$S$: The context size (how many tokens the model remembers).

For a very large model, this memory requirement becomes huge. For example, for the DeepSeek V3 base model, the KV Cache size is about 400 GB.

This large size overloads the memory, slows down computations, and increases the cost of running the model during inference. We must solve this memory issue.

Innovation 1: Multi-Query Attention (MQA)

Multi-Query Attention (MQA) was the first major way that researchers found to solve the KV Cache memory problem.

How MQA Works

The core idea of MQA is that all attention heads share the same Key (K) and Value (V) matrices.

Normal Multi-Head Attention (MHA): In standard MHA, every attention head (e.g., Head 1, Head 2, Head 3, Head 4) has its own unique, trained Key weight matrix ($W_K$) and Value weight matrix ($W_V$). Because they are all different, we must store the K and V matrices for every single head in the KV Cache.
Multi-Query Attention (MQA): MQA simplifies this. Instead of having separate K and V weights for every head, MQA uses the same $W_K$ and $W_V$ weights across all heads.
- This means Head 1, Head 2, Head 3, and Head 4 all use the same resulting Key matrix and the same resulting Value matrix.
- Note: The Query (Q) vectors for each head remain different.

The Huge Advantage: Saving Memory

Because all heads share the same K and V matrices, we only need to store the K and V for one single head in the KV Cache. We do not need to store the same data 128 times!

The KV Cache size depends heavily on the number of attention heads (N). MQA reduces the KV cache size by a factor of N.

Consider DeepSeek with its 128 attention heads.

Normal KV Cache size: 400 GB.
MQA KV Cache size: $400 \text{ GB} / 128 \approx 3 \text{ GB}$.

This reduction is massive. MQA saves memory and makes inference time much faster (up to 40% faster).

The Trade-off: Performance Loss

MQA successfully saves memory, but it introduces its own problem, sometimes called its "dark side".

The original reason for using Multi-Head Attention was to capture different perspectives on the input text. For example, one head might look at grammar, and another might look at meaning.

By forcing all heads to use the same Keys and Values, MQA restricts this diversity.

We make the heads less powerful.
The model captures fewer different perspectives.

This means that MQA, while saving memory, leads to severe performance degradation. The model will not be as good at understanding complex text.

Because DeepSeek models are known for their strong performance, they did not use MQA directly. Researchers needed a way to save memory without losing performance.

In the next part, we will explore the innovation that came after MQA—Grouped Query Attention—which aims to find a balance between saving memory and keeping model performance high.