Build DeepSeek from scratch - Part 11: Sinusoidal Positional Encoding

Introduction: Why Position Matters

Welcome back, students! We are continuing our work to understand the core parts of DeepSeek. DeepSeek uses Rotary Positional Encodings (RoPE) together with Multi-Head Latent Attention. To understand RoPE, we must first master the building blocks.

The goal of this part is to deeply understand Sinusoidal Positional Encodings (SPE). This system is important because it solves big problems that older encoding methods had. Learning this helps you build stable and complex AI systems.

1. Why We Needed Sinusoidal Encoding

In the past, we looked at two simpler ways to tell the AI where a token (word) is located:

Integer Positional Encoding: We used large numbers (like 200 or 500) to mark the position. The problem was that these large numbers polluted the token embedding. They made the original meaning of the word unclear. We want to keep the semantic information (the word's meaning) safe.
Binary Positional Encoding: We used only 0s and 1s, which fixed the problem of large numbers. However, the change between positions was not smooth; it was full of sudden "discrete jumps". These jumps made the optimization routine during pre-training very difficult for the Large Language Model (LLM).

We needed a solution that kept the values small and made the changes smooth. This led to Sinusoidal Positional Encodings.

2. The Idea: Continuous and Smooth Positions

Sinusoidal Positional Encodings were made to fix the discontinuity problem. Instead of using only 0 or 1, SPE uses values on a continuous spectrum (a smooth range).

These values come from the sine (sin) and cosine (cos) functions, which naturally create smooth, wave-like curves.

Key Advantage: Because the positional information is continuous and smooth, it is also differentiable. This leads to a much more stable and effective LLM optimization routine during training.

The values for the positional vector lie on a continuous scale, usually between -1 and 1.

3. How the Sinusoidal Formula Works

The value of the positional encoding depends on two main factors:

The Position (POS): Where the token is in the sentence (e.g., position 1 to 1024, depending on context size).
The Index (I): Which specific dimension (part) of the positional vector we are looking at (e.g., index 1 to 768, depending on model dimension).

The positional encoding (PE) value is calculated using this structure:

For Even Indexes (I = 0, 2, 4, ...): We use the Sine function.
For Odd Indexes (I = 1, 3, 5, ...): We use the Cosine function.

The core mathematical structure ensures that the values oscillate (change up and down) across positions.

3.1. Understanding Oscillation Frequency

This formula keeps a key intuition we learned from binary encodings:

Lower Indexes (I=1, I=2): These indices oscillate fast between different positions. They capture fine-grained positional details.
Higher Indexes (I=150, I=500): These indices oscillate slowly between different positions. They capture coarse (general) positional details.

The term 10,000 in the denominator helps control this change. Because the index (I) is in the denominator's exponent, a small index (lower index) creates a higher frequency (faster oscillation), and a large index (higher index) creates a lower frequency (slower oscillation). This proves that the formula works to capture position information.

4. The Power of Sine and Cosine: Rotation

The most powerful reason for using both sin and cos together is that they allow the positional vectors to be related by rotation.

Why Rotation?

We need a clear mathematical connection between positions. If a transformer knows the positional vector for position P, it should be easy to calculate the vector for a new position, say P + K (where K is the shift).

When we use sin and cos:

Finding the new vector for position P + K is achieved by rotating the initial vector for position P.
The use of cosine relates to the vector's X-coordinate, and sine relates to the Y-coordinate, which are necessary for rotation calculations.

Because the positional encodings are just rotations of each other, the transformer can easily learn the relative relationship between tokens. This is an essential property that we inject into the system.

5. Moving to Rotary Positional Encoding (RoPE)

Although Sinusoidal Positional Encodings are powerful and introduced the concept of rotation, they have one major drawback that led to modern systems like DeepSeek adopting RoPE:

The Pollution Problem: SPE vectors are directly added to the token embeddings. Even though the positional values are small, this addition pollutes the semantic information carried by the token. We want the token's meaning to remain pure.

This required a change in strategy:

Shift the Operation: Instead of adding position data to the token embeddings, we should apply it directly to the Query (Q) and Key (K) vectors inside the attention mechanism.
Use Rotation for Magnitude Preservation: Instead of adding a vector (which changes the vector's size or magnitude), we should rotate the Q and K vectors. Rotating a vector changes its direction but keeps its magnitude the same. The amount of rotation depends on the position.

These two ideas—applying position encoding at the Q/K level and using magnitude-preserving rotations—are the foundation of Rotary Positional Encoding (RoPE).

We now have the necessary tools to understand RoPE, the crucial component that DeepSeek and other modern LLMs use to prevent contamination of semantic data.