Hello Future AI Engineers!
Welcome back to our series on building DeepSeek. We are exploring quantization, which is how we make large AI models smaller and faster by using fewer bits to represent numbers.
In the previous part, we looked at mixed precision and fine-grain quantization. In this part, we look at three more technical methods DeepSeek uses to keep the model fast without losing accuracy:
- Increasing Accumulation Precision
- Mantisa Over Exponents
- Online Quantization
Understanding these techniques is very important for your AI and systems engineering studies. They show how smart design choices solve complex numerical problems when training huge models.
1. Increasing Accumulation Precision
The Problem: Low Precision Errors
When we train big AI models, the computer performs many General Matrix Multiplication (GEMM) operations, like calculating Y = W * X + B. To save memory and speed up computation, we often use low precision numbers, such as Floating Point 8 (FP8).
These calculations often happen on special parts of the GPU called Tensor Cores. A problem arises because Tensor Cores accumulate (sum up) intermediate results with a limited precision, sometimes only around 14 bits. This 14-bit precision is much lower than the standard Floating Point 32 (FP32) precision.
If we perform many multiplications with low accumulation precision, especially with large matrices (e.g., inner dimension K=4096), we quickly lose accuracy. This can lead to a significant numerical error, sometimes as large as 2%. This loss of accuracy is called limited accumulation precision.
The DeepSeek Solution: Promotion to CUDA Cores
DeepSeek found a way to increase the accumulation precision to solve this error problem. They use both the Tensor Core and the high-precision CUDA Core.
Step 1: Low Precision Accumulation (Tensor Core) First, the matrix multiplication operations start on the low-precision Tensor Core, using FP8 precision. The intermediate results are accumulated (summed up) internally in the Tensor Core with its limited precision (around 14 bits). This initial step is fast.
Step 2: Promotion to High Precision (CUDA Core) To prevent errors, DeepSeek periodically transfers these partial, low-precision accumulated results to the CUDA Core. This transfer happens after a certain number of elements, for example, after 128 elements.
The CUDA Core is a high-precision unit. When the data is moved here, the results are stored in full Floating Point 32 (FP32) precision. This process is called promotion to CUDA cores.
By periodically moving the partial sums to the high-precision memory of the CUDA Core, DeepSeek keeps the results accurate and prevents numerical errors from building up.
2. Mantisa Over Exponents
Understanding Floating Point Numbers
Any floating point number (like FP8) is made up of three parts: a sign, the exponent, and the mantisa.
- The exponent controls the dynamic range. This is the size of the numbers we can represent, from very small to very large.
- The mantisa controls the precision. This is how accurate the number is and the distance between neighboring values.
For example, two common FP8 formats are E4M3 (4 exponent bits, 3 mantisa bits) and E5M2 (5 exponent bits, 2 mantisa bits).
- E4M3: Less range, but higher precision.
- E5M2: Larger range, but lower precision.
DeepSeek's Choice
Traditionally, AI models used E4M3 for the forward pass (to keep high precision) and E5M2 for the backward pass (to keep a large range for gradients).
DeepSeek decided to use E4M3 uniformly for both the forward pass and the backward pass. This means they chose to always prioritize higher precision (more mantisa bits).
Why E4M3 works everywhere: This choice works because DeepSeek uses fine-grain quantization. Fine-grain quantization groups numbers together and applies a separate scaling factor to each small group.
Because of this grouping, numbers do not become extremely small when scaled. Since the values do not become extremely small, the smaller dynamic range of E4M3 is enough. This allows DeepSeek to use E4M3 throughout and keep the benefit of higher precision.
3. Online Quantization
The Problem: Delayed Scale Factors
Quantization requires a scale factor to correctly adjust the number values.
In a method called delayed quantization, the system calculates the scale factor based on the numbers from a previous computation (past iterations). It uses historical data, not the present data.
If the current group of numbers (current tensor) has a very different range—for instance, if a much larger number appears—the scale factor from the past might be wrong. Using a past scale factor can cause the current numbers to overflow (become too large to be represented) or underflow (become too small). This introduces errors.
The DeepSeek Solution: Real-Time Scaling
DeepSeek uses online quantization to solve this issue.
In online quantization, the scale factor is calculated in real time ("on the fly") based only on the current tensor’s data. The system computes the maximum absolute value of the current data right away.
By using the present scaling factor, online quantization makes sure that all numbers stay within the correct range for FP8. This avoids the overflow and underflow errors common in delayed quantization, making the training more stable and accurate.
These three techniques—increased accumulation precision, E4M3 utilization, and online quantization—all work together with the mixed precision and fine-grain framework to give DeepSeek strong performance gains and better memory usage.
Good job! You have now mastered the key quantization techniques that make modern LLMs highly efficient.