Build DeepSeek from scratch - Part 2: Opening the hood of the DeepSeek engine

Salam! I’m so happy you’re here for the first deep dive into the awesome world of DeepSeek and Large Language Models (LLMs). Before we talk about DeepSeek’s cool tricks, we have to understand the heart of any LLM: its architecture. Think of an LLM like a super complex car engine. Its job is simple: take a sequence of words (the "fuel") and predict the very next word (the "motion" or output). We want to open this engine right now and see how its billions of parts—called parameters—work together. The entire structure of an LLM can be divided into three main parts: The Input, The Processor (The Magic), and The Output. The Input: Getting Your Uniform Ready Imagine you are one word—let's call you a token. You are about to start a long, challenging journey! We will follow the word "friend" from the sentence: “A true friend accepts you.” This first stage is like getting ready for a big school trip where you must have the correct uniform and ID.

Isolation and the ID Badge First, you are isolated from your neighbors ("a," "true," "accepts," "you"). Now that you are alone, you get your own unique number, like a role number in class, called a token ID. This ID comes from a massive book of tokens (the vocabulary) that lists every possible character, subword (like 'ation'), or full word. For our word "friend," the badge number might be 20112.
The First Quiz: Token Embedding (The Meaning) Next, you are asked a huge set of questions—maybe 768 questions!. This is your first test, called Token Embedding. These questions check your meaning: Are you a noun? Are you related to emotion? Are you a sport?. The LLM doesn't know the questions, but the model learns the answers, capturing the semantic meaning of the word. Your answers are collected into a vector of 768 values. This is a crucial step because LLMs need to extract the meaning of the language.
The Second Quiz: Positional Embedding (The Location) The next test checks your position in the sentence. Why? Because if you have the sentence "The dog chased another dog," the meaning of the first "dog" is different from the second "dog" based on where it sits. So, you are asked another 768 questions about your position (e.g., Are you at the beginning? Are you in the middle?).
The Uniform (Input Embedding) Finally, the results from your two quizzes (Token Embedding + Positional Embedding) are added together. This combined 768-dimensional vector is your new, unique Input Embedding—your uniform!. Every other word in the sentence gets a different uniform because their meaning or position (or both) will be different. The Processor: Riding the Transformer Train With your uniform ready, you are now ready to board the most important part: the Transformer Block. This is where all the "magic" happens, enabling LLMs to summarize, check grammar, and even code. The Transformer Block is like a long train. But here’s the tough part: for models like GPT-2 Small, this train has 12 connected blocks. You, as the token, must go through all 12 of them! Inside each block, there are six main stops (compartments) you visit:
Normalization: Your uniform's values are adjusted.
Multi-Head Attention (MHA): This is super important! Here, you learn context. If you are the word "friend," MHA figures out how much attention you should pay to "true" or "accepts" to understand the sentence's meaning. (DeepSeek made great innovations in this exact area, which we will see later!).
Feed Forward Neural Network (FFNN): Your 768-dimensional uniform is expanded to a much higher dimensional space (like 4 times bigger) and then compressed back. This expansion/contraction helps the model capture more complexity and parameters. (DeepSeek's other big innovation, Mixture of Experts (MoE), happens right here).
Dropout Layers: These are small steps where some parameters are randomly turned off to prevent others from getting "lazy".
Skip Connections: These are like shortcuts to make sure the information flows smoothly through the whole engine without getting lost (the vanishing gradient problem). You go through this intense sequence 12, 24, or even 48 times, depending on the model size. Even after this difficult journey, your uniform remains the same size: 768 dimensions. The values in the vector have changed, but the size hasn't. The Output: Making the Big Choice Finally, you reach the output layer, ready to help the LLM predict the next word.
The Big Conversion: The model must convert your 768-dimensional vector into a vector that is the size of the whole vocabulary (which we assumed was 50,000). This happens in the final projection layer. Now, your vector has 50,000 dimensions.
The Prediction: Since the 50,000 dimensions represent every possible word in the vocabulary, the model looks for the word (the index) that has the highest value or highest probability. The model picks this highest probability word and declares it the "next token". This process happens for every word in the sentence, leading to multiple input-output prediction tasks, which helps the model learn the language itself. The fact that the model learns language is actually a byproduct of this next token prediction task. A Note on Probability The concept that the LLM picks the word with the highest probability is very important. When a simple program was run using an API key, it showed that while the model might select "to" as the next word for a sentence, it still assigned probability (a chance) to other tokens like "far," "places," and "where". Although the specific Python code script used to demonstrate this action is not provided in the detailed description of the architecture, its purpose was clear: to prove that the AI is a probabilistic engine making a highly educated guess based on the highest probability assigned to the next token. Output Token Result 'to' Chosen (Highest Probability) 'far' Lower Probability 'places' Lower Probability 'where' Lower Probability This entire, complicated process—from getting the ID to the final prediction—involves training and optimizing all the billions of parameters hidden inside the token embedding, positional embedding, attention layers, and feed-forward networks. That optimization is what makes the LLM so smart!

Understanding the Journey of the Token is like knowing how your kabyle carpet is made. It looks simple and beautiful when finished, but beneath the surface, there are hundreds of thousands of threads (parameters) being processed, sorted, and woven together repeatedly (the Transformer Blocks) until the final, coherent pattern (the next predicted word) appears.