Build DeepSeek from scratch - Part 21: Mixed Precision and Fine-Grained Quantization

💡 Introduction: Why Quantization Matters

Hello, future AI engineers! Today, we focus on quantization, which is part of their Floating Point A (FPA) training approach.

In large AI models, every number (parameter) takes up space. Quantization is a way to make these numbers smaller.

The main goal of quantization is simple: We reduce the precision of a model’s parameters, for example, moving from 32 bits to 8 or 16 bits. This process reduces the total memory needed for the model. Although the accuracy may drop a little because of this lower precision, the performance loss is usually small and acceptable for Large Language Models (LLMs). Understanding these techniques is key to building fast and efficient AI systems.

Let’s look at the basic data formats we use:

1. Mixed Precision Framework

DeepSeek uses a Mixed Precision Framework to get high speed while keeping the training stable.

"Mixed precision" means using low-precision formats (like FP8) for most calculations to save time and memory, but keeping important states in high-precision formats (like FP32) to ensure stability.

A. Forward Propagation (F-prop)

Forward propagation is the calculation $Y = W \times X$, where $X$ is the input, $W$ is the weights, and $Y$ is the output.

Inputs (X): Inputs are converted from BF16 to Floating Point 8 (FP8). FP8 uses fewer bits and takes less memory.
Weights (W): Weights are stored in high precision (BF16 or FP32) but are converted to FP8 on the fly (when needed) for the calculation.
Calculation: The matrix multiplication is done using FP8 weights and FP8 inputs.
Output (Y): The output is first computed in FP32 for better numerical stability. Then, the output is converted and stored back into BF16 to save memory. BF16 uses only 16 bits but keeps the wide numerical range of FP32.

B. Backward Propagation

Backward propagation calculates the gradients (how much the loss changes) to update the weights.

Input Gradient (dGrad): The partial derivative of the loss with respect to the inputs is calculated. The calculation uses FP8 for weights and the output gradient. The result is computed as FP32 and stored as BF16.
Weight Gradient (WGrad): This is the gradient used to update the weights. This gradient must be stored in FP32. It is kept in high precision (FP32) to ensure stability when updating the weights.

C. Master Weights and Stability

DeepSeek keeps the master weights (the final, stored weights) and the optimizer states in FP32. Keeping these important variables in high precision ensures the training remains stable.

Additionally, certain modules are very sensitive and must keep their original high precision (BF16 or FP32), such as:

Embedding modules.
Output heads.
Normalization operators.
Attention operators.

By smartly using FP8 for high-volume multiplications and FP32/BF16 for sensitive parts, DeepSeek achieves major speed improvements and memory reduction compared to only using BF16.

2. Fine-Grained Quantization

The second major technique DeepSeek implemented is Fine-Grained Quantization. This technique solves a common problem in quantization related to outliers.

A. The Problem with Standard Quantization

Normally, to quantize a sequence of numbers (like a vector of data) to a lower precision (e.g., FP32 to Int8), you scale the entire sequence by dividing all numbers by the single highest absolute value.

The weakness of this approach is outliers. An outlier is a single number that is much larger than all others.

If you have one very large outlier, that large number forces the scaling factor to be very small for the rest of the numbers. When small numbers are scaled down and then converted to low precision (like FP8), they lose their original precision significantly, which affects accuracy when you de-quantize them later.

B. The Fine-Grained Solution

DeepSeek’s solution is to avoid using one scaling factor for the whole block of numbers.

Fine-grained quantization works by dividing the input vector or weight matrix into smaller chunks.

Divide into Chunks: If you have an input vector (like a token embedding), it is broken down into smaller groups, perhaps 128 values each. If you have a weight matrix (like a 256x256 matrix), it is split into smaller blocks (e.g., four 128x128 blocks).
Separate Scaling: Crucially, each chunk gets its own scaling factor. The scaling factor for a chunk is found only by looking at the maximum value within that chunk.

This means if one chunk has a huge outlier, only that chunk's scaling is affected. The other chunks, which might contain smaller but important numbers, keep their precision because they are scaled appropriately by their own maximum value.

This method retains accuracy and precision better, especially when dealing with data that has large differences in values.

We have now covered the Mixed Precision Framework and Fine-Grained Quantization, two core ideas in DeepSeek's approach to efficient training. These techniques allow the model to use less memory and run faster by managing how data is stored and calculated. In the next part, we will explore further quantization ideas, like increasing accumulation precision.