Build DeepSeek from scratch - Part 17: DeepSeek's MoE Innovations

Introduction: Why DeepSeek Is Different

Hello students! Mixture of Experts (MoE) module replaces the standard Feed Forward Network (FFN) in a Transformer model.

DeepSeek did not invent the MoE architecture, but they found ways to make it much better and more efficient. They identified key problems in traditional MoE systems and created smart solutions.

In this post, we will study the three main technical innovations that make DeepSeek MoE so powerful:

Auxiliary Loss-Free Load Balancing.
Shared Experts.
Fine-Grained Expert Segmentation.

Understanding these changes is key for building high-performance AI models, as they lead to both better model performance and better load balance.

1. Innovation 1: Auxiliary Loss-Free Load Balancing

When using traditional MoE models, we must ensure that all experts are used equally. This is called load balancing. If one expert handles all the tokens, the model is inefficient.

The Problem with the Old Way

The old way to balance the load was by adding an extra term called the auxiliary loss to the main training loss.

The main training loss helps the model predict the next token. The auxiliary loss helps keep the experts balanced.

The problem is that these two losses interfere with each other.

If you make the auxiliary loss term too small, experts become unbalanced.
If you make the auxiliary loss term too big, the model performs poorly at predicting the next token (it degrades training quality).

This trade-off made it hard to balance experts well without reducing the quality of the model.

DeepSeek’s Solution: The Dynamic Bias

DeepSeek solved this problem by completely removing the auxiliary loss term. They implemented load balance without using any extra gradients during training.

They use a technique based on a dynamic bias term:

Check the Load: The system first finds the average token load (the average number of tokens each expert should process).
Find the Violation: It compares the average load to the actual number of tokens sent to each expert.
- If an expert receives fewer tokens than the average, it is underloaded (positive load violation).
- If an expert receives more tokens, it is overloaded (negative load violation).
Adjust the Bias: A bias term is dynamically updated for every expert based on its load violation.
- For underloaded experts (Expert 1, 2), the bias term is increased.
- For overloaded experts (Expert 3), the bias term is reduced.
Route the Tokens: This updated bias term is then added or subtracted from the Expert Selector Matrix (the logits) before routing the tokens.

This process ensures that underloaded experts have a higher probability of being chosen next time, and overloaded experts have a lower probability. This maintains balance for the experts while keeping the training loss term clean and efficient.

2. Innovation 2: Shared Experts

Traditional MoE models often suffer from knowledge redundancy.

The Redundancy Problem

Knowledge redundancy means that multiple specialized experts might all learn the same common information. For example, three different experts might all learn the same basic grammar rules. This hinders true specialization and wastes computational power.

DeepSeek’s Solution: Two Groups of Experts

DeepSeek solved redundancy by dividing all experts into two groups:

Routed Experts: These are the standard experts that are activated sparsely (only a few are selected for each token using Top-K routing). They focus on specialized knowledge.
Shared Experts: This is a small group of experts that are always activated for every single token. They process every input token.

The Shared Experts handle all the common tasks and general knowledge. Because the common knowledge is centralized here, the Routed Experts do not need to learn it again.

This change allows the Routed Experts to focus only on highly specialized tasks, making the overall architecture much more efficient. The final MoE output is simply the combination (summation) of the outputs from the Shared Experts and the Routed Experts. Shared Experts solve the knowledge redundancy problem.

3. Innovation 3: Fine-Grained Expert Segmentation

The second major problem in traditional MoE is knowledge hybridity.

The Hybridity Problem

If a model only has a limited number of experts (e.g., 8 to 16), each expert is forced to acquire many diverse types of knowledge. This "hybridity" means the experts are not highly specialized. They try to do too many things at once.

DeepSeek’s Solution: More, Smaller Experts

DeepSeek introduced Fine-Grained Expert Segmentation. The simple idea is to have a huge number of experts instead of a limited number.

DeepSeek does this by taking each large expert (FFN) and splitting it into many smaller experts.

Important Technical Detail: To ensure that the total model size and computational cost do not increase, DeepSeek reduces the hidden dimension of each small expert. This means the total number of parameters remains the same.

By having many more experts, DeepSeek allows each expert to specialize in a very specific type of knowledge. This solves the knowledge hybridity problem.

This method lets DeepSeek achieve comparable performance to much larger models, but with a significantly smaller number of activated parameters, making training much cheaper.

Conclusion: Building an Efficient System

By implementing these three innovations, DeepSeek built an MoE model that is both highly specialized and highly efficient:

Auxiliary Loss-Free Load Balancing: Ensures all experts work equally hard without making the training process slow or inaccurate.
Shared Experts: Centralizes common knowledge to prevent wasting resources on redundancy.
Fine-Grained Segmentation: Creates many specialized experts to handle complex, diverse knowledge (hybridity).

You now have a complete technical understanding of how DeepSeek optimized the Mixture of Experts architecture. This knowledge is necessary for the final steps of building and training your own efficient large language model.