1. Introduction: Why DeepSeek Needs to Know Position
Welcome to our series on building the DeepSeek architecture! Our main goal is to understand the powerful components of DeepSeek V2 and V3 in detail.
Today, we start with Positional Encodings (or Positional Embeddings). This topic is very important because DeepSeek V2 and V3 use an advanced method called Rotary Position Embedding (RoPE) combined with their main attention mechanism. To truly understand RoPE, we must first learn the foundational ideas of position encoding.
If you are learning AI or systems engineering, understanding positional encoding is crucial. It shows you how language models know the order of words, which is key to understanding context.
2. The Problem: Ignoring Word Position
In a sentence, the position of a word is very important for its meaning.
Consider this simple sentence: "The dog chased another dog". The sentence has two instances of the word "dog".
2.1 The Issue with Token Embeddings Alone
In a Transformer model, the input text is first turned into token embeddings. Token embeddings capture the semantic meaning of the word—what the word means.
If we only use token embeddings and ignore position, the vector for the first "dog" and the second "dog" are exactly the same because they are the same word.
When these two identical vectors go into the Transformer block, the block performs the same operations on both. This means the final context vector (the output of the attention block) for the first "dog" and the second "dog" will also be exactly the same.
2.2 Why Identical Outputs are Bad
We do not want the context vectors to be identical. In the sentence, the first dog is the one chasing, and the second dog is the one being chased. The model must capture this difference in context.
Without positional information, the model cannot tell the difference between two identical words in different spots, even if they refer to different things. This is why we need to add positional information.
Solution: We add a positional embedding vector to the token embedding vector before passing it to the Transformer. This addition creates a unique input embedding for every token, even if the token itself is repeated.
3. Attempt 1: Integer Positional Encoding
To encode position, the simplest idea is to use the word's index, or position number, in the sentence.
3.1 How Integer Encoding Works
- Find the position number (e.g., position 200).
- Create a vector where this number (200) is repeated across all dimensions of the embedding (e.g., 200, 200, 200, ...).
- Add this large positional vector to the smaller token embedding vector.
For our example, the first dog might be at position 200 and the second dog at 203. This successfully makes the input embedding vector for both dogs different.
3.2 The Major Problem: Value Magnitude
This method has a serious fault: the size of the numbers (magnitude).
- Token Embedding Values are Small: Token embedding values are usually small, often clustered around zero. They hold the important semantic meaning of the word.
- Position Values are Large: If the context is long (e.g., 1024 words), the position values can be very large (up to 1024).
When you add a very small value (token embedding) to a very large value (positional encoding), the small value's effect is completely lost. This huge positional value heavily pollutes the semantic information captured by the small token embedding. We need the positional encoding values to be smaller and constrained.
4. Attempt 2: Binary Positional Encoding
To constrain the values, we move from standard integers to binary numbers.
4.1 How Binary Encoding Solves the Magnitude Problem
The main idea is to represent the position number (e.g., 200) using its binary form.
- Constrained Values: When we use binary, the values in the positional vector are only 0 or 1.
- Similar Magnitude: Because the values are now 0s and 1s, the size of the positional vector is similar to the size of the token embedding vector.
- Preserving Semantics: This similar size ensures that the position information does not dilute the word's semantic information captured by the token embedding.
We now have two variables for understanding position: the position in the sentence (e.g., 64, 65, 66) and the index within the binary vector (Index 1, Index 2, etc.).
4.2 Understanding Index Oscillation
When we look at the binary representation, we notice how the bits (indexes) change as the position number increases.
- Least Significant Bit (LSB): This is Index 1, the lowest index. This index oscillates (changes) the fastest. It changes between 0 and 1 with every new position. Lower indexes are good at capturing fast, fine-grained changes between nearby positions.
- Most Significant Bit (MSB): This is Index 8, the highest index. This index oscillates very slowly. It might only change after 128 positions. Higher indexes capture slow, broad-level changes across positions.
This structure is a major step because it allows different parts of the positional vector to capture different types of relationships (nearby changes vs. far changes).
4.3 The New Problem: Discontinuity
Even though binary encoding solves the magnitude issue, it creates a new problem related to training the model.
Binary values are discrete—they only jump between 0 and 1. This means the graph showing how the index values change is full of sudden jumps, or discontinuities.
These jumps make the optimization process (training the LLM using back propagation) difficult.
The Next Step: We need a way to keep the values constrained (like binary) but make the changes smooth and continuous. If the values are smooth, the process is differentiable, which makes back propagation much easier for the model. This is the intuition that leads us to the next type of positional encoding: Sinusoidal Positional Encoding. We will cover this advanced concept in the next part of this series.
We have seen that to build DeepSeek, we must learn how to correctly encode position without corrupting semantic meaning. By going through integer and binary encodings, we understand the technical demands that led to the development of smoother, more advanced methods like Sinusoidal and Rotary Positional Encodings.