Build DeepSeek from scratch - Part 1: The Foundation

Salam! I’m so excited to start this series with you. We are going to explore DeepSeek, the amazing Chinese company that is making waves in the world of Artificial Intelligence. If you've ever used ChatGPT, you have already interacted with what we call a Large Language Model (LLM). DeepSeek builds powerful models just like that.

What exactly is an LLM?

It might look like magic when an AI writes a travel plan for Italy for you, but at its heart, an LLM is like a very smart prediction machine.

Think of it this way: The LLM is an engine that takes a sequence of words and then calculates the probability—the chance—of what the next word, or 'token', should be.

For example, if you give it the start of a sentence: “After years of hard work, your effort will take you...” The model looks at all the possible next words (like 'to', 'far', 'places', 'where') and assigns a chance to each one. Usually, it picks the word with the highest chance, like "to".

It is very important to remember that even if the AI answers with high confidence, underneath, it is all based on probability for every single token it produces. When you put all those highly probable tokens together, you get the long, coherent sentences we see.

The Secret Power: Why 'Large' is so Important

Why do we keep saying Large Language Models? Because size is the key to their incredible performance!

There is something called a size scaling law. This law shows that the performance of the model improves dramatically when you make the model bigger. For example, when models went from 1.5 billion parameters (like the largest GPT-2) up to 175 billion parameters (like GPT-3), we truly saw a huge jump in capability.

As the size of these language models increases exponentially, they develop emergent properties. These are awesome, magical properties that smaller models simply do not have. Imagine a small car suddenly being able to fly!

These emergent properties mean that once the model size crosses a certain point, it suddenly starts learning things like performing arithmetic, doing translation, summarizing texts, or checking grammar. This ability to handle a wide range of tasks—unlike earlier models that were built for only one task like translation—is why LLMs are so powerful and why everyone is racing to build models with a trillion parameters.

Building a Giant: The Two Essential Stages

If you want to create a powerful LLM, you need to follow two critical steps:

1. Pre-training (The Foundation)

This first stage is where we create what is called the foundational model.

To do this, we need huge amounts of data. Think of assembling information from everything available: the internet, textbooks, Wikipedia articles, and research. This is where the model learns its basic knowledge.

But this part is seriously expensive! Training these foundational models can cost millions, and sometimes hundreds of millions of dollars.

2. Fine-tuning (Making it a Genius)

After the model has basic capabilities from pre-training, we move to fine-tuning. This stage uses labeled data to teach the model to be useful and follow specific instructions.

For example, to teach the model how to follow instructions, we might give it the instruction, "Convert 45 kilometers to meters," and provide the correct answer, "45,000 M". This feedback helps the model learn. For models like GPT 3.5 (which became ChatGPT), they even used human annotators to grade the output, feeding that feedback back to the model—this is called reinforcement learning with human feedback (RLHF).

DeepSeek made massive changes to this fine-tuning stage, which is why their models, especially DeepSeek R1, are so special—we will definitely dive into that later!.

Understanding these basics is the perfect starting point. The LLMs DeepSeek builds follow these rules, and next time, we will see how their specific models, like the 671 billion parameter Version 3 and the game-changing DeepSeek R1, became so popular!.

It is helpful to view the creation of an LLM like baking a massive, complex cake. The Pre-training stage is when you gather all the ingredients (data) and mix them together in a giant bowl (the foundational model). You end up with a huge, raw cake (basic capabilities). The Fine-tuning stage is where you add the complicated frosting, decorations, and instructions (labeled data) to make it ready for a specific event—turning the basic cake into a perfect, specialized masterpiece!