Build DeepSeek from scratch - Part 20: The Foundation of Quantization

December 17, 2025 (3mo ago)

Introduction: Why Quantization Matters

Hello, future AI engineers! We have studied three important parts of the DeepSeek model: latent attention, mixture of experts, and multi-token prediction.

Now, we look at the final main part: quantization. Quantization is part of the infrastructure DeepSeek uses.

Why is this important for you? Large language models (LLMs) need huge amounts of computer memory. By understanding quantization, you learn how to reduce memory usage significantly while keeping good performance. This is key for systems engineering.

We will learn simply what quantization is and why we need it.

1. AI Parameters and Memory

An AI model like DeepSeek has many elements called parameters. These parameters are numbers.

We find parameters in many places inside the model: in token embeddings, in the multi-head attention block, and in feed-forward networks. When you build an LLM, these parameters take up memory, like how building a house takes up space.

In an LLM, calculations happen all the time. For example, in the multi-head attention block, matrix multiplications happen, where inputs are multiplied by weights.

2. How Numbers Use Memory (FP32)

The amount of memory a parameter uses depends on how the number is represented.

By default, most parameters are represented using Floating Point 32 (FP32).

For a huge model with 70 billion parameters, using FP32 requires 280 GB of memory.

3. What is Quantization?

Quantization is the process of reducing the precision of a model's parameters.

It means we take a parameter that uses many bits (like 32) and represent it using fewer bits (like 16 or 8).

The Benefits of Quantization

We reduce the number of bits to save memory.

If we represent the same 70 billion parameters using only 16 bits (FP16), the memory requirement drops from 280 GB to just 140 GB. This is a big saving!

The Challenge of Precision

When you use fewer bits, you get lower precision. This is the cost of saving memory.

Think of it like an image: the original image might have many colors and be very sharp. A quantized image uses fewer colors, so it is less sharp or more "pixelated" if you zoom in.

However, for many operations in an LLM, losing a small amount of precision is acceptable. Overall, the model's accuracy does not drop too much. We do quantization because we save memory while the performance loss is small.

4. Different Representations

When we quantize, we change the bit representation. DeepSeek uses or deals with several types of representations:

| Name | Bits Used | Description | Range and Precision | | :--- | :--- | :--- | :--- | | FP32 | 32 bits | Default floating point representation. | Highest precision and widest range. | | FP16 | 16 bits | Floating point 16. | Reduced memory, but also reduced range and precision. | | BF16 | 16 bits | Brain Float 16. | Uses 16 bits like FP16, but is designed to keep the wide range of FP32. | | Int8 | 8 bits | Integer 8. | Lowest number of bits. The values are integers (no decimals) with a very small range (e.g., -127 to 127). | | FP8 | 8 bits | Floating Point 8. | Similar to Int8, but allows decimal points (floating points). |

5. The Core Math of Quantization

How do we convert a large FP32 number to a smaller representation like Int8?

We use a simple scaling technique to fit the number into the new, smaller range (like -127 to 127).

  1. Find the Maximum: Look at all your FP32 numbers and find the largest number.
  2. Scale Down: Divide every FP32 number by this largest value. This process scales the numbers down.
  3. Scale Up: Multiply the result by the maximum value of the target bit type (e.g., multiply by 127 for Int8).
  4. Round: Take the closest integer.

This calculation ensures that the big set of FP32 numbers is correctly represented in the smaller set of numbers. This step of dividing by the maximum value is important for techniques like "fine-grained quantization".

Next Steps: DeepSeek's Innovations

Now that we understand the basics of what quantization is and why we do it, we can look at the advanced techniques DeepSeek uses.

The DeepSeek technical report has five main innovations for their quantization routine:

  1. Mixed Precision Framework
  2. Fine-Grained Quantization
  3. Increasing Accumulation Precision
  4. Mantisa over Exponents
  5. Online Quantization

These five ideas are what make DeepSeek's infrastructure special. In the next part, we will break down the first three of these complex frameworks.