Build DeepSeek from scratch - Part 14: Mixture of Experts (MoE)

1. Introduction: Why Mixture of Experts Matters

Hello, future AI engineers! We are learning how to build powerful language models like DeepSeek. DeepSeek has key new ideas. One major idea is Multi-head Latent Attention (MLA), and the second is the Mixture of Experts (MoE) architecture.

Understanding MoE is very important for two reasons:

It is one of the most modern and innovative techniques used in large language models (LLMs) today.
It solves a big problem: making huge models faster to train and faster to use.

In this part, we will look at the main idea of MoE, which helps us use far less computing power during pre-training compared to traditional models.

2. The Traditional Transformer Block and the FFN Problem

In a large language model, the key building block is the Transformer Block. Inside this block, all the main work happens. The block contains Multi-Head Attention and a component called the Feed Forward Neural Network (FFN).

The Role of the Feed Forward Network

The FFN is critical. When an input goes through the FFN, it first expands the input dimension (usually four times larger) and then contracts it back to the original size. This expansion and contraction allows the language model to explore a much richer space of possibilities.

The Cost of the FFN

The problem with the FFN is the number of parameters (weights) it uses.

If your input dimension is 768, the FFN expansion layer alone requires millions of parameters.
For one single Transformer block, the FFN might have around 4 million parameters.
If a model has many blocks (like 12 blocks), the total parameters in the FFNs can be 48 to 50 million.

When you have a dense model, every single input token must pass through all these parameters. This huge computation:

Increases the time needed for training (pre-training).
Increases the time needed for generating output (inference).

3. The MoE Solution: Experts and Sparsity

Mixture of Experts (MoE) is an innovation that changes how the FFN works. Although the main idea of MoE is not new (it first appeared in 1991), DeepSeek built upon it using creative new techniques.

Replacing One FFN with Multiple Experts

In MoE, instead of having just one FFN in the Transformer block, we use multiple smaller neural networks. These smaller networks are called experts.

For example, a Transformer block might now have four experts instead of one large FFN. Having more networks seems like it would require more computation, but the opposite is true.

The Core Idea: Sparsity

The secret to why MoE is faster is sparsity.

In a dense, traditional model:

Every input token activates 100% of the FFN parameters.

In a sparse MoE model:

Only a small subset of experts are activated for any input token.
If a model has 64 experts, often only 2 experts are used (activated) at one time for one token.

Because we activate only a few specialized experts, the total computation needed is much lower than activating one huge dense FFN. This trick reduces pre-training time and makes inference faster.

4. How Experts Specialize

The reason we can use sparsity is that these experts are specialized. They are trained to handle specific kinds of information.

The complex task of language modeling is split into smaller subtasks, and each expert network solves a subtask.

For example, in deep language models, different experts learn to handle different parts of language:

One expert might specialize in punctuation (like commas and full stops).
Another expert might specialize in conjunctions and articles (like 'the' or 'if').
Other experts learn to deal with numbers or proper names.

When a token (for example, the number '7') enters the block, the system uses a gating network to decide which expert is best. Since the number expert is specialized for this job, only that small expert is activated, saving massive computation time compared to activating the entire model.

5. MoE Across the Entire DeepSeek Architecture

It is very important to understand that the MoE structure is not just in one Transformer block. Mixture of Experts is used in all Transformer blocks throughout the model.

When an input token travels through the model (from layer 1 to layer 12, for example):

In the first block, the token might be routed to Expert 1.
In the second block, the same token might be routed to a different Expert (Expert 3).
In the third block, it might go to Expert 2.

The specialization and routing can be different in every layer, based on what computation is needed at that depth.

6. DeepSeek's MoE Innovations

DeepSeek did not just borrow the old MoE idea; they improved it. The DeepSeek architecture included new ideas like:

Fine Grained Expert Segmentation
Shared Expert Isolation
Later versions also used loss-free load balancing

These specific techniques helped DeepSeek efficiently manage its experts and optimize performance, making it a powerful modern architecture.

Action Step for Students:

The concept of sparsity is the foundation of MoE. We have seen the intuition—now we must look at the mathematics of MoE to understand how the routing and selection process actually works in code. Prepare your notes for the next lesson!