Build DeepSeek from scratch - Part 6: The KV Cache Memory Problem

Introduction: Why Speed Matters in LLMs

Hello students! Welcome to our next step in understanding how large AI models like DeepSeek are built.

Our goal today is to explain a key idea called the Key Value Cache, or KV Cache. This technique is fundamental for making Large Language Models (LLMs) work fast.

If you are studying computer science or AI, learning about the KV Cache is important. It shows you how engineers optimize systems to handle huge amounts of data and computations efficiently. Understanding the KV Cache is the basic step before we can learn about DeepSeek's special innovation, called Multi-Head Latent Attention.

1. How LLMs Generate Text (Inference)

When you ask an LLM a question (like ChatGPT), the model is performing inference. Inference is the stage where a pre-trained model predicts the next piece of text.

The KV Cache is only used during this inference stage.

The model generates text one word, or token, at a time. The process works in a loop:

You give the model an input sequence (a sentence or question).
The model predicts the very next token.
This new token is then added back to the end of the input sequence.
The new, longer sequence goes through the model again to predict the next token.
This loop repeats until the response is finished.

For example, if the input is "The next day is," the model predicts "bright". Then, the new input is "The next day is bright," and the model predicts the next token.

2. The Problem: Repeating Calculations

In the standard LLM process, this loop causes a big problem: repeated computations.

Every time a new token is added, the entire input sequence must pass through the LLM architecture again.

Look at our example:

First, the model computes for "The next day is".
Then, the model computes for "The next day is bright".

When the second sentence is processed, the model unnecessarily repeats all the work it already did for "The next day is". We are performing the same calculations again and again.

This repetition leads to two main issues:

Slow Speed: Without optimization, the time needed for calculations increases very quickly, or quadratically, as the input length grows.
High Cost: More computations mean more memory usage and higher running costs.

3. The Solution: Caching Keys and Values

Engineers asked: "Can we store the results from previous steps to avoid repeating the work?". This is where the KV Cache comes in.

Understanding the Core Need

In the attention mechanism that LLMs use, we process three main components: Query (Q), Key (K), and Value (V) matrices.

A critical technical insight is needed here: to predict the next token, we only need the context information for the newest token. We do not need the computed context vectors for the older tokens anymore.

To calculate the context vector for the newest token, we need three things related to attention:

The Query vector (Q) for the new token.
The Key matrix (K) for all tokens (old and new).
The Value matrix (V) for all tokens (old and new).

Implementing the KV Cache

Since the Key and Value matrices (K and V) for the old tokens stay the same, we can simply store them in memory. Storing previously computed values is called caching.

This is why it is called the Key Value Cache. We only cache K and V; we do not need to cache the Queries (Q).

When a new token arrives, the LLM only performs three new computations:

Calculate the new Query vector (Q) for the single new token.
Calculate the new Key vector (K) for the single new token.
Calculate the new Value vector (V) for the single new token.

The system then appends (adds) the new K and V vectors to the stored KV Cache. It uses the new Query vector (Q) with the combined (cached + new) K and V matrices to get the context vector for only the last token. This single context vector is all we need to predict the next token.

4. The Impact: Advantages and the Dark Side

The Good Side: Speed

The KV Cache dramatically speeds up inference time.

By avoiding repeated calculations, the computation time now scales linearly with the number of input tokens. Linear scaling means computation time increases slowly and steadily, which is much better than quadratic scaling.

The use of KV Cache can make inference two or three times faster than running without it.

The Bad Side: Memory Cost

The Key Value Cache has a "dark side": it uses a lot of memory. Caching means storing data, and storing data takes up space.

The total size of the KV Cache depends on several factors:

The context length (S): the maximum number of tokens the model can remember.
The number of transformer layers (L).
The batch size (B) (how many tasks run together).
The dimensions of the model (N x H).

When the context length is very large, the KV Cache size becomes massive. For a large DeepSeek model, the KV Cache might need up to 400GB of storage.

Storing such a huge cache is expensive and occupies the memory, which can slow down other computations. This is why LLM providers charge more money for models that support a larger context length.

Summary and Next Steps

The KV Cache is a smart engineering solution that saves time by storing past computations (Keys and Values). It changes computation complexity from quadratic (too slow) to linear (much faster).

However, the major drawback is the high memory usage.

To deal with this expensive memory problem, DeepSeek and other companies invented new techniques. DeepSeek’s innovation, Multi-Head Latent Attention, was created specifically to deal with the "dark side" of the KV Cache.

In the next part, we will start learning about the innovations designed to reduce the KV Cache memory footprint!