Build DeepSeek from scratch - Part 3: The Attention Mechanism

December 13, 2025 (3mo ago)

Salam! I’m so excited that we are continuing our journey into the brains of Large Language Models (LLMs) like those built by DeepSeek. In our last chat, we followed a single word, a token, as it got its ID badge and traveled through the massive Transformer Block.

Today, we are going to focus on the most important stop on that Transformer train: The Attention Mechanism. This is the key component that makes LLMs so good at understanding language.

A Look Back: When AI Had No Memory

To appreciate the attention mechanism, we need a quick history lesson. Long ago, AI was not very smart:

  1. Eliza (1960s): This was one of the first chatbots. It acted like a therapist. If you told it, "I am having a hard time learning AI," it might answer, "You believe it is normal to be having a hard time learning AI?". It was revolutionary at the time, but not very helpful compared to today’s GPT-4.
  2. RNNs and LSTMs (1980s and 1997): These were the next big steps. They solved a major problem: old neural networks couldn't deal with memory. Recurrent Neural Networks (RNNs) used "hidden states" to capture the past, meaning the current word's state depended on the previous word's state.

The Context Bottleneck: One Vector for Too Much Information

RNNs and LSTMs were okay for short sentences, especially for simple tasks like sequence-to-sequence language translation (e.g., English to French).

But researchers quickly found a massive issue when the text got long, called the Context Bottleneck Problem.

Imagine you are translating a huge paragraph. In an RNN setup, all the context—everything that happened in the whole paragraph—is compressed into just one final hidden state (one vector).

It’s like trying to remember an entire 50-page textbook for your exam, but you are only allowed to write one sentence summary. It is impossible! You lose most of the important details, and the decoder block (the part that translates or generates the output) only receives that single, overloaded vector.

To solve this, we needed a revolutionary idea: the model shouldn't just rely on the end. It needed to be able to selectively access parts of the input sequence during decoding.

The Birth of Attention

The solution was Attention. Attention simply means calculating the relative importance that should be given to different tokens when generating the output.

The first paper that put this idea into practice was the Bahdanau attention mechanism in 2014. It proved that you could dramatically improve translation if you let the decoder look back at all the hidden states of the encoder, not just the last one.

Then, in 2017, the famous Attention is All You Need paper came out. Researchers realized they didn't even need the old RNN structure anymore. They completely scrapped the RNNs and introduced the Transformer architecture—which placed the Attention Mechanism right at the heart of the engine.

Self-Attention: LLM's Secret Weapon

For modern LLMs like GPT or DeepSeek, the model is built to do one thing: predict the next token. To do this, we use something called Self-Attention.

Unlike the first attention (which was between two different sequences, like English and French), Self-Attention is attention within the same sequence.

The main purpose of Self-Attention is to see how different words relate to each other. If you have the sentence, "I am from Pune, India. I speak...", when the model looks at the word "speak," it needs to pay maximum attention to "Pune" and "India" because your dialect is influenced by your region. This is how LLMs get context!

From Uniform to Context Vector

The whole goal of the attention mechanism is to enrich the information carried by each token.

  1. Input Embedding: When a token enters the Transformer, it has its Input Embedding (its "uniform"), which is a vector of 768 dimensions. This vector knows the token's meaning and position, but it has no information about its neighbors.
  2. Context Vector: After passing through the Self-Attention block, the token comes out as a Context Vector. This vector is an enriched version of the uniform because it now contains information about all its neighbors and how important they are.

The Contextual Problem: Why Simple Math Fails

To find the relationship (the importance, or "attention score") between tokens, the simplest thing to do would be to calculate a Dot Product between the vectors. The dot product measures semantic similarity—if two words are close in meaning, the score is high.

But simple similarity isn't enough when context is involved.

Example: Look at this tricky sentence: "The dog chased the ball but it couldn't catch it."

When the model looks at the word "it" (the second 'it' in the sentence), it should pay more attention to "ball," because you catch a ball, not a dog.

If you used a simple dot product between the input embedding for 'it' and the input embeddings for 'dog' and 'ball,' the attention scores might be identical (e.g., both .51). This is a disaster because the model cannot distinguish the subtle contextual relationship—it fails to see that "catch" refers to the "ball".

The Deep Learning Trick: Queries, Keys, and Values

Since humans couldn't write a perfect mathematical rule to capture this linguistic complexity, researchers used the main trick of deep learning: If you can't figure out the relationship, replace it with trainable matrices and let the neural network learn it!.

This is why we introduce three new concepts:

  1. Query (Q): This is the token we are currently looking at (the "asker," like the word "it").
  2. Key (K): These are all the other tokens in the sentence (the things the Query is looking at, like "dog" and "ball").
  3. Value (V): (We will cover this fully next time, in sha'Allah!).

The simple input embeddings are now multiplied by random, trainable weight matrices (WQ and WK) to transform them into a new, higher-dimensional space.

Now, when the attention scores are calculated in this new transformed space, the model can learn the context.

For our example:

| Query (It) vs. Key | Simple Dot Product | Transformed Dot Product (Using WQ/WK) | Result | | :--- | :--- | :--- | :--- | | It vs. Dog | 0.51 (Example) | 0.56 | Low attention | | It vs. Ball | 0.51 (Example) | 0.96 | High attention |

By adding these trainable matrices, we have more knobs to play around with. The training process figures out the best values for WQ and WK so that the attention score between 'it' and 'ball' is much higher, accurately capturing the context. This is why Attention is such a game-changer!