Building DeepSeek from Scratch: Balancing the Experts (Part 16)
Welcome back, future AI engineers!
Today, we continue our deep dive into the technology behind DeepSeek. When we build very large language models (LLMs) using the Mixture of Experts (MoE) design, we aim for speed and efficiency. MoE achieves this by having many small neural networks (experts), but only using a few for each piece of data (token).
The Goal: Our main challenge is simple: How do we ensure all experts work equally?. If some experts do too much work and others do nothing, training becomes slow and bad. We want a balanced model.
This lesson will teach you three major methods implemented in MoE to solve this problem: Auxiliary Loss, Load Balancing, and the Capacity Factor. Understanding these is key to mastering DeepSeek's innovations.
1. Auxiliary Loss: Making Experts Equally Important
In MoE, a routing mechanism selects a small group of experts for every token. This mechanism creates an "expert selector weight matrix" that shows which expert gets which token and how much weight (importance) that expert gets.
1.1 What is Expert Importance?
We must first find the "Expert Importance" score for each expert.
- We look at the probabilities of all tokens being routed to one specific expert (this is one column in the expert selector weight matrix).
- We add all these probabilities together.
- The total sum is the Expert Importance.
If Expert 3 has a total importance of 1.6 and Expert 2 has 1.0, Expert 3 is doing much more work than Expert 2.
1.2 Using Auxiliary Loss for Balance
We want all expert importance values to be roughly the same. If one expert is neglected (importance is zero), it sits idle and learns nothing.
To fix this, we add a special term called Auxiliary Loss to the main LLM training loss.
What the Auxiliary Loss does:
- It looks at how much the importance scores of the experts vary (change).
- It penalizes (punishes) the model if the variation is high.
- This penalty forces the router to choose experts more uniformly.
We use a mathematical tool called the Coefficient of Variation (CV) to track this change. CV is calculated by dividing the standard deviation by the mean of the importance scores.
The Auxiliary Loss formula is:
$$ \text{Auxiliary Loss} = \lambda \times (\text{Coefficient of Variation})^2 $$
Where $\lambda$ (lambda) is a scaling factor (a hyperparameter you set).
When the CV is high, the loss is high, meaning the experts are unbalanced. When we minimize this loss, we make sure all experts have similar importance.
2. Load Balancing: Making Token Routing Uniform
Just making the importance scores equal is not enough. An expert might have a high importance score but receive very few tokens, while another expert receives many tokens but with low probabilities. We need to balance the actual load—the number of tokens each expert receives.
We use Load Balancing Loss to ensure the load is balanced. This loss uses two key quantities for every expert $i$:
2.1 Two Key Quantities
-
$\pi_i$ (Pi): Probability of Expert Selection.
- This is the probability that the router chooses expert $i$.
- It is calculated using the Expert Importance score we discussed before.
- Example: Expert Importance (1.4) divided by Total Tokens (4) gives $\pi$ (0.35).
-
$F_i$ (F): Fraction of Tokens Dispatched.
- This is the actual fraction of tokens sent to expert $i$.
- If 2 out of 4 tokens go to Expert 1, $F_1 = 2/4 = 0.5$.
2.2 Minimizing Imbalance with Loss
The Load Balancing Loss minimizes the product of these two factors: $\sum (F_i \times \pi_i)$.
Why this works:
- When the system is highly unbalanced (e.g., Expert 1 gets all tokens, $F_1=1$), the product $F_i \times \pi_i$ is high.
- When the system is balanced (e.g., tokens are split evenly, $F_i$ and $\pi_i$ are similar and low), the loss is low.
Minimizing this loss pushes both the importance ($\pi_i$) and the actual token routing ($F_i$) to be uniform across all experts. This alignment ensures that experts with high importance also handle more tokens.
3. Expert Capacity: Setting a Maximum Limit
Load balancing and auxiliary loss help, but we need one more safeguard. We need to prevent one expert from receiving too many tokens—from hogging the limelight.
The Expert Capacity acts as a hard limit: it defines the maximum number of tokens a single expert can process in one batch.
The calculation for the capacity is:
$$ \text{Expert Capacity} = \frac{\text{Tokens per Batch}}{\text{Number of Experts}} \times \text{Capacity Factor} $$
3.1 Understanding the Capacity Factor
The Capacity Factor is a multiplier:
- If the Capacity Factor is 1.0, every expert gets exactly an equal share of the tokens.
- If the Capacity Factor is greater than 1.0 (like 1.25 or 1.5), it gives experts a little extra room to handle more tokens than their fair share. This is common because MoE routing is not always perfectly equal.
- If the Capacity Factor is less than 1.0, some tokens must be dropped because there is not enough capacity for all of them.
By setting a capacity, we ensure that no single expert can receive, for example, four times the normal number of tokens in one batch, forcing better distribution. This guardrail increases stability during training.
Summary
You have now learned the three crucial steps for balancing experts in MoE training:
- Auxiliary Loss: Ensures that the importance scores of experts are similar.
- Load Balancing Loss: Ensures that the actual fraction of tokens routed to experts is uniform.
- Capacity Factor: Sets a maximum limit on tokens per expert to prevent extreme imbalance.
These foundational concepts are essential. In the next parts of this series, we will study the specific innovations that DeepSeek used based on these principles, such as shared experts and fine-grain expert segmentation. Keep learning and building!