Demonstration code to create a GPT-2-style, generative small LAnguage Model that can be built using personal computing.
This is not for production. You can use this code to learn about Tensorflow, generative language models, preprocessing, training, and model hyperparameters.
git clone [email protected]:bioteam/sLAM.git
cd sLAM
pip3 install .Complete the installation:
>python3
>>> import nltk
>>> nltk.download('punkt_tab')nltk is used for sentence tokenization.
> python3 sLAM/make-slam.py -h
usage: make-slam.py [-h] [-t TEXT_PERCENTAGE] [--context_size CONTEXT_SIZE] [-n NAME] [--temperature TEMPERATURE] [--epochs EPOCHS] [--d_model D_MODEL] [-d {wikitext-2-v1,cc_news}] [--num_datasets NUM_DATASETS]
[--min_chunk_len MIN_CHUNK_LEN] -p PROMPT [-v]
options:
-h, --help show this help message and exit
-t TEXT_PERCENTAGE, --text_percentage TEXT_PERCENTAGE
Percentage of wikitext-2-v1 used to make dataset
--context_size CONTEXT_SIZE
Context size
-n NAME, --name NAME Name used to save files, default is timestamp of completion
--temperature TEMPERATURE
Temperature used for generation
--epochs EPOCHS Number of epochs
--d_model D_MODEL Number of embedding dimensions
-d, --download {wikitext-2-v1,cc_news}
Dataset to download
--num_datasets NUM_DATASETS
Number of datasets to download from cc_news
--min_chunk_len MIN_CHUNK_LEN
Minimum length of cc_news chunk to use for training
-p PROMPT, --prompt PROMPT
Prompt
-v, --verbose VerboseThe code uses cs_news (the default) or wikitext-2-v1 from Hugging Face as training text. cs_news is the cleaner of the 2, with less formatting text.
epochs(3): Number of complete passes through the entire training datasetvocab_size(50,000): Number of unique tokens the model can understand/generatecontext_size(32): Maximum sequence length the model can process at once (the "memory window")d_model(256): Dimensionality of embeddings and internal representationsn_heads(4): Number of parallel attention headsn_layers(4): Number of transformer blocks stacked togetherd_ff(1024): Hidden layer size in the transformer blockstemperature(0.7): text generation randomnessdownload(cc_news): text input sourcenum_datasets(5000): number of datasets to download from cc_news
Download and clean training data from cs_news, tokenize it into large chunks, create a model, train the model using context-window-sized slices for 3 epochs, be verbose, and try the given prompt:
python3 sLAM/make-slam.py --num_datasets 1000 -v --epochs 3 -p "This is a test"The command creates a Keras model using ~1M input tokens and a saved (serialized) tokenizer with the same name, and histograms of chunk lengths and token input numbers. for example:
-rw-r--r-- 332M Apr 1 05:09 04-01-2025-05-09-04.keras
-rw-r--r-- 58K Apr 1 05:09 04-01-2025-05-09-04.pkl
-rw-r--r-- 19K Mar 31 16:04 chunk_length_distribution.png
-rw-r--r-- 19K Mar 31 16:05 token_number_distribution.png1 epoch takes about ~1 hour on a Mac M1 laptop (32 GB RAM) with the command above. However, more text and more epochs need to be used to generate syntactically and semantically correct English.
Supply the name of the model and the serialized tokenizer, and a prompt:
python3 sLAM/generate.py -n 04-01-2025-05-09-04 -p "This is a test"
This is a test of now playing this case means he may be caught trying to arrest him by name and his wife and their father i really wanted to play for him and to get even better.Very little text input is needed to get decent syntax but the semantics are off. As you increase the number of epochs and the number of input tokens the output approaches semantically correct English, for example, after 100 epochs with 10K cc_news datasets:
This is a test of paris ap — french president emmanuel macron is condemning the rally in charlottesville today to neonazis skinheads and ku klux klan members and the white nationalists were met with hundreds of counterprotesters.
This is a test of now playing gives you a party to your family and friends who shared with facebook on facebook now playing what would you do with the explosion now playing what might mean for us now playing family.
Here's a detailed explanation of the key components and how they work together. The model is a type of neural network architecture that can understand and generate sequential text. Unlike encoder-decoder models used for translation, this architecture focuses purely on text generation by predicting the next token in a sequence.
An embedding is a numerical vector representation or tensor of a discrete item like a token string or token position. The embeddings are initialized with random values within a small range, then trained. During training the model adjusts these embeddings by small increments, then tests these new embeddings to see if they can predict the next token better. In this code specifically, each embedding is a 1D tensor (vector) but the sLAM code also uses 3D tensors. When data is batched for training, each batch contains multiple sequences (in language modeling a sequence refers to an ordered series of tokens):
-
Shape:
(batch_size, context_size, d_model)=(batch_size, 32, 256)by default- Dimension 1: batch size (e.g., 4 samples processed together)
- Dimension 2: sequence length (32 tokens)
- Dimension 3: embedding dimensions (256-dimensional vectors)
There are 2 kinds of embeddings:
Token Embeddings: Convert each word token into a dense vector representation (e.g., 256 dimensions). These embeddings are adjusted - "learn" - in training to capture semantic meaning - similar words end up with similar vector representations.
Positional Embeddings: Transformers process all tokens simultaneously and need explicit position information. Positional embeddings encode where each token appears in the sequence, enabling the model to predict correct word order and syntax.
Token and positional embeddings are created using Keras Embedding layers, which are tables that map token indices to vectors.
In the TokenAndPositionEmbedding class in the code:
Token Embeddings:
This creates a lookup table with 50,000 rows (1 for each token in the vocabulary), where each row is a 256-dimensional vector. When you look up a token ID, you get its corresponding embedding vector.
self.token_emb = layers.Embedding(
input_dim=vocab_size, # 50,000 tokens
output_dim=d_model, # 256 dimensions
name="token_embeddings"
)Positional Embeddings:
This creates another lookup table with 32 rows (1 for each position in the sequence), where each row is also a 256-dimensional vector. When you look up a position (0-31), you get its corresponding embedding.
self.pos_emb = layers.Embedding(
input_dim=context_size, # 32 tokens or positions
output_dim=d_model, # 256 dimensions
name="position_embeddings"
)Combined in the call method:
token_embeddings = self.token_emb(inputs) # Get token vectors
position_embeddings = self.pos_emb(positions) # Get position vectors
return token_embeddings + position_embeddings # Add them togetherBoth embedding layers are learnable weights that are trained during model training through backpropagation. The 256 floats for each token start as random values and backpropagation adjusts each float based on the loss gradient. Over time, tokens that appear in similar contexts end up with similar embedding vectors. For example:
- Token "cat" might initially be: [0.02, -0.15, 0.08, ..., 0.12]
- Token "dog" might initially be: [-0.11, 0.09, -0.03, ..., 0.18]
After training, if they appear in similar contexts, their 256 floats are adjusted by the optimizer to become more similar.
A core innovation of transformers (Vaswani et al., 2017) is self-attention, which trains each token to "attend to" or relate to other relevant tokens in the sequence.
Attention is a computational mechanism that transforms token embeddings into Query (Q), Key (K), and Value (V) vectors, then computes similarity scores between Q and K to generate weights via softmax, and finally produces a weighted sum of the V vectors. Multi-head attention performs this computation multiple times in parallel with different learned transformations. Position influences attention in 2 ways: positional embeddings are incorporated into the token representations that become Q, K, and V, and causal masking constrains which positions each token can attend to—preventing attention to future tokens and allowing only self and past token attention.
Attention(Q,K,V) = softmax(QK^T/√d_k)V
The formula computes a weighted sum of the Values (V). The softmax of (QK^T/√d_k) produces attention weights, a probability distribution showing which keys are most relevant to each query. These weights are then applied to V to produce an output tensor where each position has been enriched with information from all other relevant positions. In the sLAM code, this attention output is fed into a residual connection (added back to the input), then passed through layer normalization, and then into the feed-forward network. So the attention result is the "refined" token representation that incorporates context from other tokens in the sequence.
During the Q-K comparison, some token pairs produce high dot product scores and others produce low scores. Tokens with high scores are given more weight in the final output. What counts as "relevant" (what produces high scores) is learned by the model during training, the Q, K, and V weight matrices are adjusted so that the model learns to assign high scores to tokens that help predict the next token accurately.
Q, K, and V are computed from the input embeddings using learnable weight matrices. So the actual weights that get adjusted during backpropagation are those transformation matrices (often called projection matrices). For each input embedding, the model multiplies it by these weight matrices to produce Q, K, and V vectors. During training, backpropagation adjusts these weight matrices so that the resulting Q, K, V vectors encode relationships that help the model predict the next token accurately.
In a decoder-only model like this the model attends to its own previous tokens to predict the next one. This is called causal self-attention — each token can only see tokens before it, not future ones. If the input is the sequence of tokens A | cat | sat | on | the then we want to predict the word after "the". Each token attends to all preceding tokens (including itself), something like this:
1.Each token gets Q, K, V vectors
"A" → Q₁, K₁, V₁
"cat" → Q₂, K₂, V₂
"sat" → Q₃, K₃, V₃
"on" → Q₄, K₄, V₄
"the" → Q₅, K₅, V₅ ← this is the query position
2.Score Q₅ against all previous keys
score("the" vs "A") = Q₅ · K₁ = 0.4
score("the" vs "cat") = Q₅ · K₂ = 2.1
score("the" vs "sat") = Q₅ · K₃ = 0.8
score("the" vs "on") = Q₅ · K₄ = 1.1
score("the" vs "the") = Q₅ · K₅ = 3.9 ← attends to itself + context
Future tokens are masked to -∞ before softmax, so they become 0.
3.Softmax → attention weights
weights ≈ [0.02, 0.12, 0.05, 0.08, 0.73]
The model leans heavily on the current "the" (it's a determiner, so a noun likely follows) but also picks up signal from "cat" (a noun came after the first "the").
4.Weighted sum → context vector → predict next token
context = 0.02·V₁ + 0.12·V₂ + 0.05·V₃ + 0.08·V₄ + 0.73·V₅
This context vector feeds into a linear layer + softmax over the vocabulary. The model outputs a probability distribution — ideally high probability on tokens like "mat", "floor", "rug", etc.
What training does The correct next token ("mat") is known. The cross-entropy loss is computed, and backpropagation adjusts Wq, Wk, Wv so that over many examples:
- "A" preceding a noun learns to attend to prior nouns for context
- "sat on the __" learns to up-weight location/surface nouns
Input tokens are converted to embeddings, combined with positional embeddings, and passed through a dropout layer. The data then flows through 4 stacked transformer blocks (detailed below), and a final dense layer projects the output to vocabulary size to produce logits for next-token prediction. Data flows through as 3D tensors of shape (batch_size, sequence_length, embedding_dimension), and all weights are learned through backpropagation during training.
The complete flow is:
Input → Token+Position Embeddings → Dropout → [Transformer Block 1] → [Transformer Block 2] → [Transformer Block 3] → [Transformer Block 4] → Output Dense Layer → LogitsThe embedding and dropout layers prepare the data, the blocks do the main processing, and the final dense layer produces the prediction scores for each token in the vocabulary.
There are layers both before and after the transformer blocks:
Before the transformer blocks:
- Input layer - accepts the token IDs
- TokenAndPositionEmbedding layer - combines token embeddings with positional embeddings
- Dropout layer - randomly drops 10% of values for regularization
After the transformer blocks:
Dense layer - projects the output to the vocabulary size to produce logits for predicting the next token
Dropout (Srivastava et al., 2014) is a regularization technique that prevents overfitting. During training, it randomly sets a percentage of values (10% by default in sLAM) to zero. This forces the model to learn more robust features by not relying on any single activation value. It's like training with incomplete information—the model learns to work with different random subsets of neurons, which makes it generalize better to new data. During generation/inference, dropout is not applied.
For example, with 10% dropout applied to an 8-dimensional vector during training:
Before dropout: [0.5, 1.2, -0.3, 0.8, 0.1, -0.6, 0.9, 0.4]
After dropout: [0.5, 0.0, -0.3, 0.8, 0.1, -0.6, 0.0, 0.4] ← ~10% randomly zeroed
Dropout is used in 3 places: before the blocks, within each block's attention mechanism, and after each block's feed-forward network.
The model has 4 blocks, by default, specified by the n_layers parameter. Each transformer block contains the following components, in order:
- Multi-head attention layer with causal masking
- Residual connection (He et al., 2016) adding attention output back to input
- Layer normalization
- Feed-forward network with 2 dense layers:
- First dense layer with GELU (Hendrycks & Gimpel, 2016) activation:
layers.Dense(self.d_ff, activation="gelu")(x) - Second dense layer with no activation (linear):
layers.Dense(self.d_model)(ff_output)
- First dense layer with GELU (Hendrycks & Gimpel, 2016) activation:
- Dropout layer for regularization (10% by default, applied after feed-forward)
- Residual connection adding feed-forward output back to input
- Layer normalization
Layer normalization (Ba et al., 2016) normalizes the activations (output values) of a layer to have a mean of 0 and standard deviation of 1. This keeps the values from becoming too large or too small, which:
- Prevents training from becoming unstable
- Allows for higher learning rates
For example, a 4-dimensional activation vector before and after layer normalization:
Before: [1.0, 3.0, 5.0, 7.0] (mean=4.0, std=2.24)
After: [-1.34, -0.45, 0.45, 1.34] (mean≈0.0, std≈1.0)
Layer normalization occurs in 2 places within each transformer block:
After the attention + residual connection:
x = layers.Add()([x, attn_output])
x = layers.LayerNormalization(epsilon=1e-6)(x) # <-- HEREThis normalizes the output before sending it to the feed-forward network.
After the feed-forward + residual connection:
x = layers.Add()([x, ff_output])
x = layers.LayerNormalization(epsilon=1e-6)(x) # <-- AND HEREThis normalizes the output at the end of the block before it goes to the next block (or output layer).
It's different from batch normalization (which normalizes across the batch dimension)—layer normalization normalizes across the feature dimension for each individual sample, which is more suitable for transformers and sequence models.
An unappreciated detail to a novice is that all input to a neural network, training or inference, is some form of matrix filled with numbers, integer or float.
- Text Cleaning: Filters high-quality text from datasets (cc_news or wikitext)
- Tokenization: Converts text to integer token IDs using Keras TextVectorization
- Sequence Creation: Sliding window approach creates input/target pairs for next token prediction, for example:
- Input:
[token1, token2, token3, token4] - Target:
[token2, token3, token4, token5]
- Input:
Within a single epoch:
- Batching: Training data is divided into batches (default batch_size=4 in
slam.py) - Forward Pass: For each batch, the model makes predictions
- Loss Calculation: Compare predictions to target tokens using
SparseCategoricalCrossentropy - Backward Pass (Backpropagation): Compute gradients
- Parameter Update: Adam optimizer updates all weights
- Repeat: Steps 2-5 repeat for each batch until all training data is processed
- Validation: After all batches, model is evaluated on validation data
- Epoch Complete: 1 full pass is done
For example in slam.py if you have 10,000 training samples and batch_size=4:
- Steps per epoch = 10,000 ÷ 4 = 2,500 steps
- With epochs=3, training runs 3 complete passes = 7,500 steps
During the forward pass, embeddings are looked up and attention transforms them into contextual representations — the embeddings themselves are unchanged, attention just reads from them. During the backward pass, gradients flow through the attention mechanism back to the embeddings and the optimizer updates all weights — embeddings, Q/K/V matrices, FFN weights. Attention doesn't modify the embedding table directly; it carries the gradient signal through which the embeddings learn what they should represent.
The loss function measures how wrong the model's predictions are and serves as the signal that guides learning. SparseCategoricalCrossentropy compares the model's predicted logits to the actual next token ID. In essence, the loss function is the feedback mechanism that tells the model how to learn.
For example, if the correct next token is "mat" (token ID 2) and the model produces logits for 5 tokens:
Logits: [1.2, 0.5, 3.8, 0.1, -0.3]
After softmax: [0.06, 0.03, 0.82, 0.02, 0.01] (probabilities sum to ~1.0)
↑
token ID 2 = "mat"
Loss = -log(0.82) = 0.20 (low loss — good prediction)
If the model had assigned only 0.05 probability to "mat", the loss would be -log(0.05) = 3.0 — a much higher loss, producing larger gradients and bigger weight updates.
Adam (Kingma & Ba, 2015) is an optimization algorithm used to update the model's weights during training based on the gradients computed from the loss function.
self.optimizer = tf.keras.optimizers.Adam(
learning_rate=lr_schedule, epsilon=1e-8
)The Adam optimizer stores and uses data from previous steps for optimization using exponentially decaying moving averages, not by storing all raw results from every past epoch or step. This memory is maintained across epochs as part of the optimizer's internal state. Here is how Adam uses past data:
- Momentum (First Moment Estimate): Adam maintains a running exponential moving average of the past gradients, often referred to as the "first moment" vector (denoted as m). This helps to smooth out the optimization path and maintain velocity in consistent directions, leading to faster convergence and reduced oscillation.
- Adaptive Learning Rates (Second Moment Estimate): Adam also tracks an exponential moving average of the squared gradients, known as the "second moment" vector (denoted as v). This information is used to adapt the learning rate for each individual parameter of the model, allowing for larger steps for infrequent parameters and smaller steps for frequent ones.
- Bias Correction: The moving averages m and v are initialized with zeros and are therefore biased towards zero, especially during the initial iterations of training. Adam applies a bias correction mechanism to these estimates to ensure they are more accurate, particularly in the early stages of training.
m and v are updated at every training step and persist throughout the entire training process, allowing the optimizer to intelligently navigate the complex loss landscape. The memory requirement for this is relatively low as it only involves storing 2 moving average vectors that are the same size as the model's parameters.
The gradients inform the training code on how to adjust the internal learnable weights, or parameters. Gradients in a neural network are vectors of partial derivatives that measure how much the network's loss (error) changes with respect to its weights and biases. They represent the slope of the cost function, pointing in the direction of the steepest ascent. By computing gradients via backpropagation, optimizers update parameters in the opposite direction (gradient descent) to minimize error and improve model performance.
The gradients indicate the direction and rate at which parameters should be adjusted to reduce error. They point "uphill" towards higher loss, which is why algorithms move in the negative gradient direction to go "downhill" (minimize loss). A large gradient indicates a steep slope (requiring significant updates), while a small gradient indicates a flat region. The backpropagation algorithm calculates the gradients for every parameter by traversing the network backward from the output layer to the input layer.
Adam is the default choice for training modern neural networks including language models like sLAM because it combines the benefits of momentum-based methods with adaptive learning rates.
The FFN in sLAM is 2 dense, fully connected layers:
layers.Dense(self.d_ff, activation="gelu") # expand: 256 → 1024
layers.Dense(self.d_model) # contract: 1024 → 256This expands the representation to a larger dimension (1024), applies a non-linear activation function (GELU), then contracts back down to the model dimension (256). An activation function is a mathematical function applied to a layer's output that introduces non-linearity — without it, stacking multiple layers would be equivalent to a single linear transformation, and the network couldn't learn complex patterns. This expansion-contraction gives the model extra capacity to transform the representation in ways attention alone can't.
For example, with a simplified 3 → 6 → 3 expansion-contraction:
Input vector (3-dim): [0.5, -0.2, 0.8]
After expand + GELU (6-dim): [0.0, 0.7, -0.0, 1.2, 0.3, -0.0] ← richer representation
After contract (3-dim): [0.9, 0.1, -0.4] ← transformed back to original size
The input and output are the same dimensionality but the values have been transformed — the expansion to a higher dimension gives the network room to compute features it couldn't represent in the smaller space.
A rough intuition for the division of labor:
- Attention — mixes information across tokens (which tokens relate to which)
- FFN — transforms the representation of each token independently (no cross-token interaction)
The FFN weights are also learned during training via backpropagation, just like the embedding and attention weight matrices. Both FFN weights and embeddings are weight matrices updated by backpropagation — but they work differently.
Each embedding corresponds to a specific token, but the FFN weights are applied via matrix multiplication to every token's vector, they are not tied to specific tokens — the same weights are applied to every position A concrete way to see the difference:
Embedding: token_id=42 → look up row 42 → [0.3, -0.1, 0.8, ...]
FFN: vector → multiply by W → new transformed vector
The deeper similarity is that both are just matrices of floats that get adjusted by Adam during training. In that sense all learned parameters in a neural network are matrices of numbers updated by gradient descent. The difference is in how they're used during the forward pass.
Checkpoints serve as intermediate saves of the model's learned weights during training, enabling recovery, model selection, and efficient disk management. A Callback is a mechanism that hooks into the training process at specific events (epoch end, batch end, etc.). In slam.py, callbacks are custom classes like ValidationPrintCallback and SmartCheckpointCallback that inherit from tf.keras.callbacks.Callback. They execute custom logic during different stages of training.
For example, The SmartCheckpointCallback manages the creation and lifecycle of checkpoint files, saving them at intervals and keeping only the N best checkpoints based on validation loss.
The custom callbacks for monitoring training stability:
ValidationPrintCallback: Tracks performance on held-out dataSmartCheckpointCallback: Saves model state during trainingEarlyStopping: when training stops due to no improvement, the model weights are restored to the best checkpoint
A useful intuition: attention decides where to look (dynamic), while the embeddings determine what that means (static, learned at training time). Both are essential — attention alone with random weights predicts nothing useful, and embeddings without attention can't model sequential dependencies. See "Token and Positional Embeddings" above for details on how embeddings work.
The attention Q, K, V weight matrices are also learned weights, but they compute dynamic, context-dependent transformations — projecting embeddings into vectors that can be compared, with attention scores determining which Values get weighted together. The final prediction comes from the dense output layer projecting the last hidden state → vocabulary logits. The full chain of learned weights is:
→ Token Embedding (learned weights) → Positional embeddings (learned weights) → Attention (Q,K,V learned weight matrices) → FFN (learned weight matrices) → Output dense layer (learned weights) → Logits over 50,000 tokens
During generation, the model:
-
Encodes the input prompt into token IDs
-
Predicts probability distribution over all possible next tokens
-
Applies temperature scaling: Controls randomness (lower = more deterministic, higher = more creative). For example, given the same logits for 4 candidate tokens:
Temperature 0.5: [0.01, 0.02, 0.95, 0.02] ← concentrated, nearly deterministicTemperature 1.0: [0.05, 0.10, 0.70, 0.15] ← balancedTemperature 1.5: [0.12, 0.18, 0.42, 0.28] ← flattened, more creative/random -
Samples next token from the probability distribution
-
Updates context window by sliding tokens left and adding the new token
-
Repeats until desired length or end token is reached
This sLAM models I've made are much smaller than production models like GPT* in most respects:
Arbitrary Parameter Count:
- sLAM: ~1-5M parameters (depending on vocab_size and d_model settings)
- GPT-2 small: 117M parameters
- GPT-3: 175B parameters
Model Dimensions:
- sLAM: 256 embedding dimensions, 4 transformer blocks, 4 attention heads
- GPT-2 small: 768 embedding dimensions, 12 transformer blocks, 12 attention heads
- GPT-3: 12,288 embedding dimensions, 96 transformer blocks, 96 attention heads
Context Window:
- sLAM: 32 tokens
- GPT-2: 1024 tokens
- GPT-3: 2,048 tokens
Training Data:
- sLAM: Thousands of text samples (megabytes)
- GPT-2: 40GB of text data
- GPT-3: 570GB of text data
Compute Requirements:
- sLAM: Trainable on consumer hardware (few GB RAM, optional GPU)
- GPT-2: 8 TPUs v2 (comparable to 32-64 NVIDIA V100s)
- GPT-3: Required thousands of high-end GPUs and months of training
One of the challenges in writing and running TensorFlow code is how many dependencies there are, and how quickly new versions replace old versions. To get all your versions aligned start with your computer, which may be a GPU. For example, if it's NVIDIA, what is the recommended version of CUDA? From that version find the recommended version of Tensorflow or PyTorch. Then for that package version what version of Python. An example set of versions, working with an older NVIDIA GPU:
RTX 5000 + CUDA 11.8 + Tensorflow 2.17 + Python 3.8
Then all the other Python dependencies (e.g. pandas, numpy) will follow from the Python version.
Getting these versions aligned is critical, because if the versions are out of alignment you create executable code but get errors of various kinds that do not reference versions and are difficult to debug, like out-of-memory or data shape errors.
Containers may be available that package all the right versions, e.g. of CUDA, Python, and Tensorflow. In this example we're computing at the Texas Advanced Computing Center and downloading a Tensorflow container from NVIDIA:
srun -N 1 -n 10 -p rtx-dev -t 60:00 --pty bash
module load tacc-apptainer
apptainer pull docker://tensorflow/tensorflow:2.17.0-gpuOr just:
docker pull tensorflow/tensorflow:2.17.0-gpuThen you can run your script with singularity.
singularity exec --nv tensorflow_2.17.0-gpu.sif python3 scripts/mnist_convnet.py