Understanding Multi-Token Prediction (MTP)

Introduction: Why This Part Matters

Hello, future AI engineers! In this series, we build a powerful system like DeepSeek. DeepSeek V3 uses three big ideas to work well: Multi-Head Latent Attention, Mixture of Experts, and Multi-Token Prediction (MTP).

If you are learning AI and systems engineering, understanding MTP is key. MTP changes how the model learns during training. It helps the model become smarter at forecasting, which is very important for modern Large Language Models (LLMs).

This part explains the foundation of MTP: what it is and why it is so useful for building highly efficient AI models.

1. Single Token vs. Multi-Token Prediction

Think about how standard language models learn. They usually use Single Token Prediction.

When you give the model an input sentence, it predicts only one next token for every word.

Example: If the input is "Artificial", the model predicts "intelligence".

This standard method means the model only learns the next immediate step.

How Multi-Token Prediction is Different

DeepSeek uses Multi-Token Prediction (MTP). MTP means that for every input token, the model predicts multiple future tokens at the same time.

Let's say we want to predict three future tokens.

Example: If the input is "Artificial", the model predicts three tokens: "intelligence," "is," and "changing" simultaneously.

The training process then calculates the error (loss) between the three predicted tokens and the three actual tokens. We are looking at a longer future horizon for every input token.

2. Why MTP is Powerful: Four Key Benefits

MTP is not just a simple change; it makes the training much better. There are four main reasons why DeepSeek and other modern LLMs use MTP:

A. Richer Training Signals

Training is how the model learns from data. MTP provides richer and denser training signals than the standard method.

More Information Per Step: Single token prediction only gives the model one piece of information (one gradient) per input. MTP gives multiple pieces of information (multiple gradients) for the same input, making the training sample more informative.
Learning Long-Range Structure: Because the model predicts multiple steps ahead, it learns about the structure, grammar, and coherence of the text over longer ranges. This guides the model to be better at planning and forecasting sequences.

B. Improved Data Efficiency

MTP helps the model learn more using the same amount of data.

Research has shown that models trained with MTP achieve better results on standard AI tests, such as HumanEval and MBPP, even with the same amount of training data. For instance, they solved about 15% more code problems on average.

This benefit is especially noticeable when the AI model size is large.

C. Better Planning and Decision Making

MTP helps the model become better at making important decisions.

Choice Points: Some tokens are "choice points"—key tokens that greatly influence how the rest of the sentence will look.
Prioritizing Decisions: Because the consequential token (the choice point) is predicted many times across different input steps, its error appears repeatedly in the loss calculation. This means the training process implicitly assigns higher importance (weight) to learning these crucial decision-making elements, leading to better planning. The model focuses its attention on the most critical transitions.

D. Faster Inference Speed

MTP can also lead to faster prediction (inference).

Since the model is trained to predict multiple tokens at once, it can speed up the inference process by up to three times.

Self-Speculative Decoding: MTP makes a technique called speculative decoding possible. This involves using a small, fast model (using MTP) to predict several tokens, which are then quickly checked by the main, large model. This allows the LLM to generate text much faster.

Important Note for DeepSeek: While MTP helps make inference faster, DeepSeek V3 mainly used MTP during pre-training to get the benefits of richer signals and better planning. During the final prediction stage (inference), DeepSeek typically discarded the MTP modules and predicted one token at a time.

Next Steps

Now you know the "why" behind DeepSeek's powerful Multi-Token Prediction strategy. It makes the model smarter, more efficient, and better at planning the future sequence of tokens.

Build DeepSeek from scratch - Part 18: Multi-Token Prediction Foundation