Introduction: Solving the Hardest Problem in AI Systems
Welcome, students! In this part of building DeepSeek, we look at a very smart engineering solution. We want our AI model to be fast and use little memory (this is Multi-Head Latent Attention, MLA). We also want it to perform very well (this is Rotary Positional Encoding, RoPE).
The challenge is that MLA and RoPE do not work well together. If you use them normally, the model becomes slow.
The goal of this post is to understand how DeepSeek V2 and V3 solved this conflict by integrating MLA with RoPE. Understanding this "Decoupled RoPE" trick is essential for advanced AI and systems engineering, as it teaches us how to get the best of both worlds.
1. The Core Idea: Why Latent Attention Works
First, let us quickly remember Multi-Head Latent Attention (MLA). MLA is designed to save memory during inference (when the model is running).
The Power of the Absorption Trick
In standard attention, we need to compute and cache (save) large key ($\text{K}$) and value ($\text{V}$) matrices for every token.
MLA uses a clever method called the absorption trick:
- The input embedding is multiplied by a matrix ($W_{DKV}$) to project it into a smaller, latent dimension. This creates the latent Key-Value matrix, $C_{KV}$.
- For keys, several weight matrices ($W_Q$ and $W_{UK}$ transpose) are mathematically combined, or "absorbed," into one single matrix.
- Because these weights are absorbed, we only need to cache the small $C_{KV}$ matrix. We do not need to recompute the full keys for all tokens, which saves a lot of time and memory.
The most important rule for the absorption trick to work is that the weight matrices must be next to each other so they can be multiplied and combined.
2. The Conflict: RoPE Breaks the Absorption Trick
Rotary Positional Encoding (RoPE) is a technique that gives the model important information about the position of words in the sequence. RoPE works by rotating parts of the query and key vectors based on their position.
The Problem with Combining Them
- RoPE is Position Dependent: The rotation applied by RoPE depends on the position of the token.
- RoPE Stands in the Middle: If you try to apply RoPE to the queries and keys in MLA, the RoPE function sits between the weight matrices ($W_Q$ and $W_{UK}$ transpose) that we need to absorb.
- Absorption Fails: Because the RoPE function changes based on position, the weight matrices can no longer be absorbed into one fixed matrix.
If absorption fails, we lose the main advantage of MLA: we must recompute the keys for all tokens during inference. This significantly slows down the model, making the use of MLA pointless. Rotary positional encoding is incompatible with key value compression (MLA).
3. The DeepSeek Solution: Decoupled RoPE
DeepSeek found a simple and smart way to solve this: Decoupled RoPE.
"Decoupled" means they split the attention calculation into two separate paths:
- Path 1: Content Attention ($Q_C, K_C$) - This path handles the core semantic information without applying RoPE.
- Path 2: Positional Attention ($Q_R, K_R$) - This path computes only the positional relationship by applying RoPE.
The final attention score is found by adding the scores from both paths: ($Q_C \cdot K_C^T$) plus ($Q_R \cdot K_R^T$).
Path 1: Retaining the Magic (No RoPE)
In Path 1, RoPE is not applied. This means the original absorption trick still works for $Q_C$ and $K_C$.
- Key and Value (KC, VC): These are computed from the input embedding matrix projected into the small $C_{KV}$ latent matrix. $C_{KV}$ is the latent cache.
- Query (QC): DeepSeek makes a small change here. They take the input and first down-project it to a lower dimension ($C_Q$), and then up-project it back to the full dimension to get $Q_C$. This extra step saves activation memory during training.
- Inference: When a new token comes, $Q_C$ is calculated using the absorbed weights, and we multiply it by the cached $C_{KV}$ matrix. This keeps the latency low.
Path 2: Adding Positional Information (With RoPE)
This path introduces the positional information, accepting the need for some extra computation.
- Key (KR): The input is multiplied by a new weight matrix, $W_{KR}$. RoPE is applied to this result. Importantly, DeepSeek decided to share this key weight matrix ($W_{KR}$) across all attention heads. This choice helps reduce the final KV cache size. This $K_R$ matrix must be cached during inference.
- Query (QR): The query $C_Q$ (the down-projected query from Path 1) is multiplied by $W_{QR}$ (which is not shared across heads). RoPE is applied to this result to get $Q_R$. $Q_R$ for the new token is not cached.
- Inference: When a new token comes, $Q_R$ is computed, and then multiplied by the updated cached $K_R$ matrix.
4. The Result: Efficiency and Performance
By decoupling the attention mechanism, DeepSeek achieved major improvements:
-
Reduced KV Cache Memory: The total cache memory size is significantly reduced. We only need to cache two parts: the latent $C_{KV}$ matrix (dimension $D_L$) and the positional key $K_R$ matrix (dimension $D_{HR}$). Because $K_R$ uses shared weights across heads, we only cache its dimension once. The memory size is now proportional to $D_L + D_{HR}$.
-
Massive Savings: Compared to standard Multi-Head Attention (MHA), this approach can reduce the memory needed for the KV cache by a huge factor—up to 57 times less memory.
-
High Performance: Unlike other memory-saving tricks (like Multi-Query Attention or Grouped-Query Attention, MQA/GQA) which lower performance, MLA plus RoPE maintains or even improves model performance.
By using this smart decoupled technique, DeepSeek ensured that the memory-saving part of MLA (Path 1) still worked, while the performance boost from RoPE (Path 2) was successfully added. This combination offers the best capability and the lowest memory needs.