Build DeepSeek from scratch - Part 9: Multi-Head Latent Attention (MHLA)

Introduction: DeepSeek's Secret to Efficiency

Welcome, future AI engineers! In this part of our series, we focus on a major idea that made DeepSeek V2 models very fast and efficient: Multi-Head Latent Attention (MHLA).

When you study AI, especially Large Language Models (LLMs), you learn about the Transformer architecture. The attention mechanism is the core of this architecture, but it can be slow and expensive.

Our goal today is to understand exactly how DeepSeek changed this attention mechanism to save a lot of memory and money during inference (when the model is generating text). This knowledge is key for anyone learning AI and systems engineering because efficiency is crucial for building large-scale models.

1. The Problem of Repeated Work During Inference

When an LLM generates text, it predicts one word (token) after the other.

Imagine you ask the model: "Make a travel plan for Spain."

The model takes your text and predicts the next token: "The"
Now the input is "Make a travel plan for Spain The". The model uses all these words to predict the next token: "next".
The model repeats this process again and again.

The main issue is that to predict the new token, the model re-calculates everything for the old tokens every time. This repetition wastes a lot of computation time.

2. Solution 1: The Key-Value (KV) Cache

To fix repeated calculations, engineers invented the Key-Value (KV) Cache.

In the Transformer's attention block, every input token creates three important vectors: Query (Q), Key (K), and Value (V). To predict the next word, we only need the context vector for the very last token.

The realization was this: We don't need to re-calculate the Key (K) and Value (V) vectors for the old tokens. We can simply store (cache) the K and V matrices from previous steps.

When a new token arrives, we only calculate its new Q, K, and V vectors. We then combine the new K and V vectors with the stored (cached) old K and V vectors to get the full K and V matrices needed for the attention calculation.

Result: The KV Cache speeds up inference significantly because it turns a quadratic (very fast-growing) computational cost into a linear (slowly growing) cost.

3. The Big Challenge: Memory Overload

The KV Cache solved the speed problem, but it created a new, large problem: memory usage.

We must store the K and V matrices for every token in the input sequence, and we must do this for every layer in the model.

The size of the KV cache depends on:

Context Length ($S$): How many tokens we are storing.
Number of Transformer Blocks ($L$): How many layers in the model.
The dimension of the Keys and Values ($N \times H$): This is the number of attention heads multiplied by the dimension of each head.

For a large model like DeepSeek V2, using a standard KV cache and long sequences (context window of 100,000 tokens), the KV cache size can be huge—up to 400 GB. Storing this amount of data is expensive and slows down other computations.

4. Past Attempts to Reduce Memory (MQA and GQA)

To reduce the memory size, people looked at the $N \times H$ factor (number of attention heads $\times$ head dimension).

The standard Multi-Head Attention (MHA) uses different K and V values for every head, which is good for performance. This means we must cache $N$ different sets of K and V values, one for each head.

To reduce memory, people tried to make heads share content:

Multi-Query Attention (MQA): All attention heads share the exact same Key and Value matrix ($K_1 = K_2 = K_3$, and $V_1 = V_2 = V_3$). This reduces the cache size greatly (sometimes by 100 times).
Grouped Query Attention (GQA): Heads are put into small groups, and only heads within the same group share K and V values. This saves less memory than MQA but more than MHA.

The Performance Trade-off

While MQA and GQA save memory, they hurt the model's performance.

MHA: Best performance, largest cache size.
MQA: Smallest cache size, worst performance.

Why? Because sharing the Keys and Values means the attention heads cannot capture enough different information (diversity) from the text.

5. DeepSeek's Innovation: Multi-Head Latent Attention (MHLA)

The goal of DeepSeek was to achieve the best of both worlds: low cache size like MQA AND good performance like MHA.

DeepSeek achieved this by introducing two key technical ideas:

Using a Latent Space to compress information.
Using an Absorption Trick to simplify calculations during inference.

Step 1: Caching Only One Small Matrix

In the traditional KV cache, we store two large matrices: K and V.

DeepSeek asked: What if we only cache one matrix? And what if this one matrix has much smaller dimensions than the full K and V matrices ($N \times H$)?

They achieved this by taking the input embedding matrix ($X$) and multiplying it with a new down-projection weight matrix ($W_{DKV}$), creating a new, smaller matrix called the Latent KV Matrix ($C_{KV}$).

$$C_{KV} = X \times W_{DKV}$$

We only cache this single $C_{KV}$ matrix. This immediately removes the factor of 2 (for caching K and V separately) and allows us to choose a smaller dimension (e.g., 576) instead of the huge $N \times H$ dimension (e.g., $128 \times 128$).

Step 2: The Absorption Trick

How can we calculate the Keys and Values needed for attention if we only cache $C_{KV}$?

Calculating Keys and Values: The Keys ($K$) and Values ($V$) are now created by multiplying the cached $C_{KV}$ matrix with new trainable weight matrices ($W_{UK}$ and $W_{UV}$), which are unique for each attention head. This ensures that every head still has different K and V content, which is essential for good performance.
Calculating Attention Scores: Attention scores require multiplying the Query ($Q$) by the Key Transpose ($K^T$). In MHLA, Keys are defined as $K = X \times W_{DKV} \times W_{UK}$.

The Absorption Trick works because $W_{DKV}$ and $W_{UK}$ are fixed matrices (weights learned during pre-training). DeepSeek absorbs a fixed part of the Key matrix calculation ($W_{UK}$) into the Query calculation ($Q$).

When a new token arrives:
- We compute the Absorbed Query by multiplying the new token's embedding ($X_{new}$) with the fixed combined weights ($W_Q \times W_{UK}^T$).
- We update the $C_{KV}$ cache with the new token's latent vector.
- We multiply the Absorbed Query by the updated $C_{KV}$ cache to get the attention scores directly.

This is a powerful mathematical rearrangement. By absorbing some weights into the query vector, we only need to multiply the result with the small, single $C_{KV}$ cache.

6. The Result: Best of Both Worlds

DeepSeek’s Multi-Head Latent Attention solved both critical problems:

Low Memory Cost: The size of the KV cache is reduced dramatically. For DeepSeek, the reduction factor is about 57 times. They went from needing 400 GB to around 6 GB of memory for the cache.
Good Performance: Because the Keys and Values are reconstructed using separate, unique weights ($W_{UK}$, $W_{UV}$) for each attention head, the heads do not share content. This maintains the high performance of the model, unlike MQA.

This simple, smart innovation—projecting into a latent space and using mathematical absorption—allows DeepSeek to run massive models with much lower memory needs during inference.