Building Truly Open GPT-OSS: A Technical Guide
Hello, future AI engineers!
The goal of this guide is clear: to help you understand how to build a Large Language Model (LLM) called GPT OSS completely from the beginning—or "from scratch". This is not just about using parts already made; it is about starting with raw data.
Why is this important for you?
Building an LLM this way helps you learn the exact "nuts and bolts" of how these complex systems work. It teaches you the foundation of AI and systems engineering, making the technology truly accessible.
We will focus on the Nano GPT OSS version. This smaller model keeps the full essence of the architecture but costs less than $10 to train on one A40 GPU, making deep learning practical for everyone.
The Six Steps to Building an LLM
The process is divided into six main technical steps:
- Choosing the data set.
- Tokenization (converting text to numbers).
- Creating input and output pairs.
- Assembling the GPT OSS architecture.
- Setting up the pre-training pipeline.
- Generating text (inference).
Step 1: Data Set Selection
When building from scratch, we must choose the training data set wisely. We need data that is large enough to teach the model how the English language works (its structure and meaning) but small enough to train within a budget.
We use the Tiny Stories data set.
- This set has 2 million short stories.
- The stories use words that a 3 to 4-year-old child understands.
- The benefit is that even small models (like our Nano GPT OSS) can learn the correct grammatical structure (form) and the sensible meaning of English from this data.
The real test is if the model can generate new, sensible stories during inference.
Step 2: Advanced Tokenization
Computers cannot understand words directly. We must change every word into a number, or Token ID. This process is called tokenization.
Why Simple Tokenization Fails
- Word-Based Tokenization: If every unique word is a token, the vocabulary size is huge (over 1 million words for English). A large vocabulary means the model needs more parameters and training costs are too high. Also, misspelled or new words cause an Out-of-Vocabulary (OOV) problem.
- Character-Based Tokenization: Using just letters (26 characters) solves the large vocabulary problem. But it destroys the meaning carried by full words, making the model worse at understanding language. It also creates too many tokens, which can exceed the model's context window (memory).
The Solution: Subword Tokenization
GPT OSS uses a subword tokenization scheme based on the Byte Pair Encoding (BPE) algorithm. This method is the best choice.
- Saves Memory: The vocabulary size is kept reasonable (GPT OSS uses 201,088 unique IDs).
- Solves OOV: The model can break any unknown word down into individual characters if necessary.
- Keeps Meaning: It merges frequently used groups of characters into subwords (like 'ing' or 'tokeniz'). This helps keep the core meaning of words.
GPT OSS uses the O200K Harmony tokenizer. This tokenizer is special because it includes extra tokens that help the model understand conversations, like special tags for roles ("system," "user") and messages ("start of message").
Step 3: Creating Input and Output Pairs
After tokenization, we have huge lists of token IDs (numbers) saved on the disk. The model must learn from these lists.
This learning method is called self-supervised and auto-regressive. The model creates its own output from the input. The task is called Next Token Prediction.
We define two key settings for training:
- Context Length (Max Length): The maximum number of tokens the model can look at at one time (Nano GPT OSS uses 4,000 tokens).
- Batch Size: The number of input sequences the model processes in one training step.
The Next Token Prediction Trick
For any given input sequence, the target output sequence is simply the input sequence shifted to the right by one token.
For example:
- Input:
The | dog | chased - Target Output:
dog | chased | another
This method teaches the model, step-by-step, to predict the token that should come next, which allows it to learn the form and meaning of language.
Step 4: The GPT OSS Architecture
The architecture is the "brain" of the LLM. It consists of many advanced blocks (called transformer blocks). The 20 billion parameter GPT OSS uses 24 transformer blocks.
1. Token Embeddings
Input token IDs are first converted into high-dimensional vectors. We hope that after training, words with similar meanings will have vectors that are numerically closer together. GPT OSS uses an embedding dimension of 2880.
2. RMS Normalization
Normalization ensures the model trains smoothly and efficiently. GPT OSS uses Root Mean Square (RMS) Normalization. This method is slightly faster and performs better than older methods. It simply normalizes the magnitude of the vectors and is used multiple times throughout the architecture.
3. Grouped Query Attention (GQA)
The Attention Mechanism allows the model to find relationships between words in a sequence, creating a context vector.
- The Key-Value (KV) Cache Problem: To speed up text generation (inference), the model stores (caches) the Key (K) and Value (V) matrices. In older architectures (Multi-Head Attention), this cache uses a lot of memory.
- GQA Solution: Grouped Query Attention (GQA) is a middle-ground solution. It groups the attention heads, and heads within the same group share their K and V parameters.
- The Ratio: GPT OSS uses an 8:1 ratio. This greatly reduces the memory needed for the cache while still keeping high performance.
4. Sliding Window Attention
Traditional attention calculations become extremely slow as the context length ($N$) gets very long (it scales quadratically, $N^2$).
- Sliding Window: This mechanism reduces computation by thousands of times. It restricts every token from looking too far back into the past. Each token can only look at a small, fixed window of tokens before it (GPT OSS uses a window of 128 tokens).
- Implementation: The 24 transformer blocks alternate between using full causal attention and sliding window attention.
5. Rotary Positional Encodings (RoPE)
The model needs to know the position of words to understand things like "The dog chased another dog" (which 'dog' is which?).
- RoPE Solution: Instead of adding positional data (which pollutes the semantic vector), RoPE rotates the Query (Q) and Key (K) vectors based on their position. Since the magnitude of the vector is not changed, the core meaning is preserved while positional information is injected.
6. Attention Bias (Attention Sinks)
In very deep layers of the model, tokens sometimes pay too much unnecessary attention (noise) to the first few tokens in the sequence.
- The Bias Trick: GPT OSS addresses this by using attention bias (or syncs). They add an extra column (a learnable bias term) to the attention scores matrix. This column acts like a "sink".
- Benefit: The bias column absorbs this unnecessary attention probability (the noise). This ensures that the attention scores for the actual tokens remain accurate and undiluted.
7. Mixture of Experts (MoE)
In a traditional Transformer block, the attention mechanism is followed by one Feed Forward Network (FFN). GPT OSS replaces this with a Mixture of Experts (MoE).
- Experts: The 20 billion parameter model has 32 specialized FFNs, or experts.
- Sparsity: The key feature is sparsity. Only a small number (four experts) are activated for any token.
- Benefit: Experts can specialize (e.g., one expert handles verbs, another handles nouns). Crucially, inference is much faster because most parameters remain inactive.
- Activation: The experts use the Swish Linear Unit (SwiGLU) activation function.
8. The Output Layer
After passing through all transformer blocks, the vector (still 2880 dimensions) goes through a final RMS normalization and the Linear Output Layer (Unembedding).
This final layer projects the vector from 2880 dimensions to the size of the vocabulary (201,088). The output is a matrix of scores, where the highest score indicates the model’s prediction for the next token.
Steps 5 & 6: Training and Inference
Step 5: The Pre-training Pipeline
Training is the process of using the computed loss (error) to update the model's parameters.
- Optimizer and Learning Rate: We use the ADAMW optimizer with a dynamic learning rate (called cosine annealing). This learning rate starts high for exploration and smoothly decreases as the model gets better.
- Gradient Accumulation: This is an efficiency trick. If our desired batch size is too big for the GPU memory, we split the batch into smaller microbatches. We compute the gradient for each microbatch and add them up, updating the parameters only after the total batch size is reached. This simulates a large batch without running out of GPU memory.
Step 6: Inference (Generating Text)
Once training is complete and the parameters are fixed, the model can generate text.
- We give the model a starting input sequence (a prompt).
- The trained model predicts the next token (T1).
- T1 is added to the input sequence.
- The new, longer sequence is fed back into the model to predict the next token (T2).
This repeating cycle continues until the model has generated a complete, coherent story. For instance, given "A little girl went to the woods," the Nano GPT OSS can generate sensible text that follows the English language rules it has learned.
By successfully building and training Nano GPT OSS, we prove that these modern architectures are much more efficient than older models like GPT-2, achieving a lower loss much faster. This makes understanding and building AI accessible to you right now