Build DeepSeek from scratch - Part 4: Causal Attention

Introduction: Why Causal Attention Matters

Hello, future AI engineers! In this part of our series, we will learn about Causal Attention. This is a necessary step to understand the advanced DeepSeek architecture, especially its Multi-Head Latent Attention mechanism.

If you want to build large and smart AI systems, you must understand these core parts. Causal Attention is a key building block. It helps our AI model think clearly and make accurate predictions, which is essential for any language model.

1. The Core Idea: You Cannot Look Ahead

Large Language Models (LLMs) work by predicting the next word, or token, in a sequence. For example, if the input is "Mr and Mrs Dudley," the model must predict the next token, "off".

To make this prediction, the model must only use the information that comes before the word it is predicting. It cannot "cheat" by looking at tokens that come after the current position.

Example: If we are trying to predict the word "Dudley," the input must only be "Mr and Mrs".
The Rule: For every output, the input must only include tokens that occurred previously.

This rule is the whole idea behind Causal Attention. Causal Attention is a special kind of Self-Attention that follows this strict rule. It is also called Masked Attention.

2. Causal Attention vs. Self-Attention

Self-Attention allows a token to find relationships with all other tokens in the input sequence—both those that came before it and those that come after it.

When we compute the attention scores, we create a matrix. In a standard Self-Attention matrix, scores are calculated for every token combination. The scores found above the main diagonal in this matrix relate a token to the tokens that come after it.

However, if we are predicting the next token, these future scores are not useful because we do not have access to the future information.

The solution in Causal Attention is simple: we must prevent the model from using these future scores. We do this by masking them out.

3. Implementation: Using Negative Infinity

The goal is to set all attention scores above the diagonal (the "future" scores) to zero.

A straightforward way is to calculate the attention scores, apply the SoftMax function to get the attention weights, then set the future weights to zero, and finally re-normalize the rows so they sum to one. This is called two stages of normalization.

A more efficient method only needs one SoftMax step.

The Negative Infinity Trick

Start with Attention Scores: We calculate the attention scores (Queries multiplied by Keys transpose).
Apply the Mask: We replace all the future attention scores (the elements above the diagonal) with negative infinity ($\mathbf{-\infty}$).
Apply SoftMax: We apply the SoftMax function once.

Why this works: The SoftMax function uses the exponent (e) of the score. When you calculate $e^{-\infty}$, the result is zero.

By replacing future scores with negative infinity, the SoftMax function automatically makes those attention weights zero. Crucially, SoftMax also ensures that the weights in each row sum up to one, completing the normalization in a single step. This single-step process is much more computationally efficient.

4. Enhancing Generalization with Dropout

After calculating the causal attention weights, we often add another mechanism called Dropout.

Dropout is a technique used during training to prevent the model from learning too much from the training data, which can lead to poor performance on new data (overfitting).

How Dropout Works:

Dropout randomly sets a certain percentage of the attention weights to zero during each training step.
If the dropout rate is 50%, about half of the weights in every row are randomly turned off.
This prevents some parameters (neurons) from becoming "lazy" and doing all the work.
By randomly disabling some weights, Dropout forces all weights to contribute and learn effectively, which improves the model's ability to handle new information (generalization).

Summary of Causal Attention Changes

Causal Attention takes the Input Embedding Vector and converts it into a richer Context Vector by incorporating neighboring word information, just like Self-Attention. The difference lies in how it handles future tokens.

The main steps that change from basic Self-Attention are:

Masking: We replace the attention scores for future tokens with $-\infty$.
SoftMax: We apply SoftMax once, which automatically sets the masked scores to zero and normalizes the weights.
Dropout: We randomly turn off some attention weights to ensure robust learning and better generalization.

This process creates an efficient and focused attention mechanism that is crucial for models like DeepSeek, which are built to predict tokens sequentially.

(Continue to next part for Multi-Head Attention, the next step toward DeepSeek's MLA.)