Why Build Language Models From Scratch?
Hello, future AI engineers! Building a language model like Gemma 3 270 Million (Gemma 3 270M) from start to finish gives a lot of satisfaction. When you build it yourself, you understand every part of the system.
Many models are called "open source," but they often only share the weights (the trained numbers inside the model). They do not show the full process—from getting data to the final training.
Our goal here is to learn the entire process:
- Get the data.
- Convert the text into a format the computer understands.
- Prepare the input and output pairs.
- Build the Gemma architecture.
- Run the training process (pre-training).
This knowledge is important for you, as computer science students, because it helps democratize AI knowledge and shows you how to build a complete system. Gemma 3 270M is considered a Small Language Model (SLM). Even though it is smaller than models like GPT-4 (which may have a trillion parameters), building it helps us master the core steps.
Step 1: Data Set Assembly
The first step is choosing the data. To train a small model efficiently, we need a specific data set.
We will use the Tiny Stories data set. This data set has stories that are simple enough for 3 to 4 year old children to understand. This is an important choice because specific data helps the model learn the structure and nuances of the English language, even if the model is very small (like 270 million parameters). We do not need all the knowledge in the world for the model to learn language rules.
This data set has about 2 million stories (or "rows") for training. We load this data using a data sets library.
Step 2: Data Tokenization: Turning Text into Numbers
The computer cannot understand text directly, so we must convert words into a numerical format. This process is called tokenization.
Why Not Character or Word Tokenization?
Choosing the right tokenization method solves major problems:
| Method | How it Works | Problem Solved by Gemma’s Method | | :--- | :--- | :--- | | Character Level | Breaks text into single letters (e.g., 'C', 'A', 'T'). | This leads to a "ballooning effect". If every character is a token, one story becomes too long, and it might not fit inside the model's context window (the memory limit). Also, it destroys the meaning carried by words. | | Word Based | Breaks text into full words (e.g., 'cat', 'house'). | This creates an Out-of-Vocabulary (OOV) problem. If the model sees a new word, a misspelled word, or technical jargon that is not in its stored vocabulary, it cannot process it. Also, the vocabulary size is too large, making training expensive. |
Subword Tokenization (BPE)
We use Subword Tokenization, also known as Byte Pair Encoding (BPE), because it is the best of both options.
BPE starts with characters, but then it merges characters that appear together often (like 'ing' or 'ize') into one token.
- Low Vocabulary Size: We do not need to save every single word in the language, reducing computational cost.
- No OOV Problem: If a word is new, BPE can break it down into smaller, known subwords or characters.
- No Context Problem: Because tokens are subwords (not single characters), the sequence length is shorter than in character-level tokenization.
We use a vocabulary size of 50,257 for our implementation.
Storing Token IDs
After tokenizing all the stories, we must store the resulting numbers (token IDs). We store all tokens together in large files, called train.bin and validation.bin.
We save these files directly onto the disk (using memory-mapping) instead of keeping everything in RAM. This is a smart strategy because it allows for fast data loading during the training process and prevents the computer's memory from overloading when dealing with large amounts of data.
Step 3: Creation of Input-Output Pairs (Self-Supervised Learning)
Now we teach the model its main task: next token prediction.
Language models learn in a self-supervised way. We do not give it labels like in classification (e.g., "This is a cat"). Instead, the model creates its own labels.
How We Create Pairs (X and Y)
Imagine one sequence of tokens:
[Token 1, Token 2, Token 3, Token 4] (This is the Input, X).
To get the ground truth (the correct answer), we simply shift the input to the right by one position.
[Token 2, Token 3, Token 4, Token 5] (This is the Output or Ground Truth, Y).
The model is trained to predict the next token correctly at every position. For example:
- If the input is
Token 1, the desired output isToken 2. - If the input is
Token 1, Token 2, the desired output isToken 3.
This defines the model's fundamental training task.
We need two key numbers for this step:
- Context Size (Block Size): How many tokens the model looks at for input at one time.
- Batch Size: How many input-output pairs (X, Y) are processed together before the model updates its knowledge.
We select random sequences from our train.bin file to create a batch.
Step 4: Assembling the Gemma Architecture
The Gemma architecture (Gemma 3 270M) is the "brain" of the model. It has three main parts: Input Block, Processor Block, and Output Block.
The processor block contains 18 Transformer Blocks stacked on top of each other.
4.1. The Input Block
- Token Embeddings: First, the token IDs are converted into embedding vectors (high-dimensional vectors). Words with similar meanings (like 'apple' and 'mango') will be closer together in this vector space. These vectors are trainable parameters, meaning they learn meaning during training. The embedding dimension used in Gemma 3 270M is 640.
4.2. Key Components of the Processor Block
Each of the 18 Transformer blocks contains several modules.
4.2.1. RMS Normalization (RMS Norm)
Normalization helps stabilize the learning process. Gemma uses Root Mean Square (RMS) Normalization.
- The RMS value is calculated from the input vector (it is similar to the square root of the average of the squared elements).
- The vector is then divided by this RMS value.
- RMS Norm includes trainable scale and shift parameters. These are extra free parameters that help improve performance.
4.2.2. The Attention Mechanism
The attention mechanism is where the tokens start understanding their neighbors and keeping context. Gemma uses several modern techniques in attention:
- Multi-Query Attention (MQA): Traditional attention uses separate learned matrices for Query (Q), Key (K), and Value (V) for every attention head. In MQA, we save computational cost by making the Key (K) and Value (V) matrices share the same content across all attention heads. This greatly reduces the number of trainable parameters needed and saves memory.
- Sliding Window Attention: This is an optimization of causal attention. Causal attention means a token can only look at tokens that came before it (not into the future). Sliding Window Attention adds a rule: a token cannot look too far back into the past either, only within a defined window (e.g., 512 tokens). This restriction makes calculations much cheaper (up to 64 times cheaper). Gemma uses 15 sliding attention blocks and 3 full causal attention blocks.
- Rotary Positional Encodings (RoPE): Language models must know the position of tokens. Instead of adding positional information to the token embedding (which can ruin the meaning), RoPE rotates the Query (Q) and Key (K) vectors. The amount of rotation depends on the token's position. This rotation embeds position information while keeping the vector's magnitude the same.
- QK Norm: RMS normalization is applied specifically to the Query (Q) and Key (K) vectors before attention scores are calculated.
4.2.3. Feed Forward Neural Network (FFN)
The FFN expands the vector dimension (e.g., from 640 to 2048) and then contracts it back. This expansion allows the model to explore a richer, higher-dimensional space for finding patterns. Gemma uses a slightly complex FFN design with two parallel neural networks during the expansion step for better expressiveness.
4.3. The Output Block
After the input passes through all 18 transformer blocks, it reaches the final layer.
- Final RMS Norm: Normalization is applied.
- Output Layer: This layer is critical: it converts the embedding dimension (640) back to the vocabulary dimension (50,257). This conversion results in a vector (logits) for each token, showing how likely every word in the vocabulary is to be the next word.
Step 5: Pre-training: Learning the Language
The pre-training loop is where the model learns by updating its parameters (weights).
5.1. Calculating Loss
We compare the model's predictions (the output logits from the final layer) with the ground truth (the shifted input Y).
- We use the Cross-Entropy Loss function (Negative Log Likelihood).
- This loss function checks how high the probability is for the correct next token. If the probability of the correct token is high (close to 1), the loss is low. If the probability is low, the loss is high.
- The model calculates the loss for all input sequences in one batch and takes the mean.
5.2. Optimization and Efficiency
We use techniques to make training faster and more stable:
- AdamW Optimizer: This is the method used to perform back propagation (updating the parameters based on the loss).
- Learning Rate Schedule: We do not use a fixed learning rate. We start with a ramp-up (exploration) and then gradually decrease the rate (exploitation).
- Mixed Precision: We use
float 16numbers for many calculations (like matrix multiplications) because they are faster, but we use stablefloat 32for crucial steps like loss calculation. - Gradient Accumulation: If our desired batch size (e.g., 1024) is too large for the GPU memory, we split it into smaller "micro batches" (e.g., 32). We calculate and store the gradient for each micro batch sequentially, and only update the parameters after all micro batches are collected. This simulates a larger batch size without using too much GPU memory.
Step 6: Inference: Making Predictions
Once the training is done and the loss has decreased, the model is ready for inference (generating new text). The model parameters are now fixed.
How the Model Generates Text
- We give the model a short sentence (prompt).
- The prompt is passed through the 18 transformer blocks.
- We look at the output of the last token in the input sequence.
- The model chooses the next token that has the highest probability in the output layer.
- This new token is appended (added) to the original input sequence.
- The new, slightly longer sequence is passed back into the model, and the process repeats.
Even though the model was only trained to predict the very next token, it produces coherent and meaningful text. This shows that by doing the next token prediction task repeatedly, the model learned the complex rules of the English language entirely from the data and the sophisticated architecture.