A 112k-parameter language model, written from scratch in ~150 lines of PyTorch,
trained on 1MB of Shakespeare. Every line is explained. It actually works.
A language model learns by playing a guessing game: given a chunk of text, predict the
very next character. Each time it guesses wrong, we nudge its weights. Do this a few
thousand times on Shakespeare and the model figures out — entirely on its own — that
plays have character names in ALL CAPS, followed by colons, followed by lines of dialogue.
No rules were programmed. It just saw the patterns in the data and compressed them into
its weights.
The whole model, on one page
Two transformer blocks, 64-dimensional embeddings, 4 attention heads. That's it.
Drag the slider: watch the model learn
At step 0 the model has never seen a single character. By step 3000 it's writing something
that looks unmistakably like Shakespeare. Same prompt ("ROMEO:") at every checkpoint.
step 0
val loss —
loading…
Training curve
Cross-entropy loss on the training set (fine dots) and validation set (bold line).
What the model attends to
Attention from block 1, averaged across heads. Each row = where that character "looked" when deciding what came next. Lower-triangular because the model can only see the past.
What to take away
A Transformer is ~150 lines of PyTorch. There's no magic — it's embeddings, attention, an MLP, and a residual. That's the whole recipe.
Attention is just weighted averages. For each position, compute a similarity score to every past position, softmax it, and average their values.
Scale is everything. This 112k-param model writes gibberish-Shakespeare. GPT-4 has ~109× more parameters and writes essays. Same architecture.
Loss is not accuracy. A loss of 1.7 on a 65-char vocab means the model's guess carries ~1.7 bits of uncertainty per character — much less than the 6 bits of a uniform random guess.