Build DeepSeek from scratch - Part 15: DeepSeek's MoE Innovations

Introduction: Why We Need a New Architecture

Welcome, future AI engineers! This series will teach you how complex models like DeepSeek work. We start with the core idea: the Mixture of Experts (MoE) model.

In a normal Transformer AI model, there is a key part called the Feed Forward Neural Network (FFN). This FFN takes input data and processes it. But as models get very big, this FFN becomes slow and expensive to use.

The goal of this part is to understand how MoE solves this problem. We replace the single FFN with many specialized neural networks, or "experts". This method makes training (pre-training) and using the model (inference) much faster. Learning this is vital because it is a foundation for building efficient, modern AI systems.

1. What is a Mixture of Experts (MoE)?

Imagine you have one single computer program doing all jobs. MoE says: let's use several smaller, specialized programs instead.

In the Transformer model, MoE replaces the single Feed Forward Neural Network with multiple neural networks. We call these multiple FFNs 'experts'.

For example, if we have three experts (E1, E2, E3), each expert is a neural network. It keeps the input data dimension the same when it processes it. If your input token has a dimension of 8, the expert output will also have a dimension of 8.

The main challenge after getting outputs from all experts is that we receive many output matrices (e.g., three 4x8 matrices if we have three experts). We only need one final output matrix of the same size (4x8). We must find a way to combine the outputs efficiently.

2. The Power of Sparsity (Efficiency)

Why use many experts if we only need one final output? The secret is sparsity.

Sparsity means that when a piece of data (a token) enters the MoE block, not all experts are activated. Only certain, specialized experts are activated for that token.

Example: You have four experts. When a token arrives, perhaps only Expert 4 is used, and Experts 1, 2, and 3 stay inactive.
Why this matters: Because we do not activate all experts at the same time, we save a lot of computational capacity. This makes pre-training and inference much faster.

This idea of deciding how many experts will be active for each token is also called load balancing. If you have 64 experts but only decide to activate two for every token, you have very high sparsity.

3. The Routing Mechanism: Selecting and Weighing Experts

Sparsity tells us how many experts to use (e.g., Top K=2). But we still need to answer two key questions:

Which two experts should we select for a given token?
How much weight (importance) should we give to each selected expert?

The routing mechanism answers these questions using the Routing Matrix.

Step 3.1: Creating the Expert Selector Matrix

To start the routing process, we use a special trainable matrix called the Routing Matrix.

Multiply Input by Routing Matrix: We take the Input Matrix (e.g., 4 tokens x 8 dimensions) and multiply it by the Routing Matrix.
- The result is the Expert Selector Matrix.
- The rows of this new matrix match the tokens (4 rows).
- The columns match the number of experts (e.g., 3 columns for 3 experts).

Step 3.2: Selecting the Top K Experts

The Expert Selector Matrix helps us choose the right experts.

Look at Each Token (Row): For the first token, we look at the values in the first row of the Expert Selector Matrix.
Choose Highest Values: We select the K highest values in that row. If we set $K=2$, we select the two experts that have the highest corresponding values.
- Example: If the highest values in Row 1 are for Expert 2 and Expert 3, then Token 1 is routed to E2 and E3.

Step 3.3: Assigning Weightage using Softmax

Now we know which experts to use. The next task is finding how much importance (weightage) to give to them. We want the selected expert weights for each token to add up to 1.

We use the Softmax operation to do this:

Sparsity Enforcement: Before applying Softmax, we replace the values of the experts that were not selected (the inactive experts) with negative infinity.
Softmax Application: Applying Softmax ensures two things:
- Any value that was negative infinity becomes zero (enforcing sparsity).
- The values of the selected experts are normalized, so they now sum up to 1 (e.g., 0.6 + 0.4 = 1).
Result: This final matrix is the Expert Selector Weight Matrix. It tells us exactly:
- Which experts are selected (weights > 0).
- How much weight each selected expert receives (e.g., E2 gets 60%, E3 gets 40%).

4. Merging the Outputs

After all that work, we finally know how to combine the three expert output matrices into one 4x8 matrix.

This merging is done token by token using the weights we just calculated:

Look at Token 1: We look at the first row of the Expert Selector Weight Matrix. Let's say it gives E2 a weight of 0.6 and E3 a weight of 0.4.
Calculate Weighted Output:
- We take the first row of Expert Output 2 and multiply it by 0.6.
- We take the first row of Expert Output 3 and multiply it by 0.4.
Sum the Results: We add these two weighted rows together. This gives us the final 1x8 vector (row) for the first token.

We repeat this process for every token (token 2, token 3, token 4).

By combining the resultant vectors for all tokens, we get the final 4x8 output matrix. This output matrix has the correct dimensions, just like the original FFN output, but it was calculated much faster because of sparsity.

This mechanism—using sparsity to be efficient and routing to assign weights—is the main trick of Mixture of Experts. In later parts, we will see how DeepSeek improves upon this basic MoE foundation.